By David Loshin | Data Informed – June 7, 2013 | An Introduction to Big Data Application Development & MapReduce. Editor’s note: This article is part of a series examining issues related to evaluating and implementing big data analytics in business.
One of the key objectives of using a multi-processing node environment is to speed application execution by breaking up large “chunks” of work into much smaller ones that can be farmed out to a pool of available processing nodes. As long as there are no dependencies forcing any one specific task to wait to begin until another specific one ends, these smaller tasks can be executed at the same time – this is the essence of what is called “task parallelism.”
As an example, consider a telecommunications company that would like to market bundled mobile telecommunication services to a particular segment of households in which high school age children are transitioning to college and might be targeted for additional services at their future college locations. Part of the analysis involves looking for certain kinds of patterns among collections of call detail records among household members for households that fit the target model of the marketing campaign. The next step would be to look for other households who are transitioning into the target model and determine their suitability for the program. This is a good example of a big data analysis application that needs to scan millions, if not billions of call detail records to look for and then match against different sets of patterns. The collections of call detail records can be “batched” into smaller sets and analyzed in parallel for intermediate results, which can later be aggregated to provide the desired insight.
And this leads into one misconception of the big data phenomenon: the heightened expectations of easily-achievable scalable high performance resulting from automated task parallelism. One would expect that this example application would run significantly faster over larger volumes of records when it can be deployed in a big data environment, and that is the concept that has inspired such a great interest in the power of big data analytics. And yet, it is not so simple to achieve these performance speed-ups. In general, one cannot assume that any arbitrarily chosen business application can be migrated to a big data platform, recompiled, and magically scale up in both execution speed and support for massive data volumes. Having determined that the business challenge is suited to a big data solution, the programmers have to envision a method by which the problem can be solved, and they have to design and develop the algorithms for making it happen.