Apache Spark is “red hot” and an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark is not tied to the two stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications.
Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly, making it well suited to machine learning algorithms.
These Videos will give you a nice introduction to Spark, how it’s being used and why you should care…
Spark in the Hadoop Ecosystem – Eric Baldeschwieler, CTO Hortonworks
Beyond Hadoop MapReduce: Interactive Analytic Insights Using Spark
ClearStory Data’s Sharmila Mulligan and Stephanie McReynolds discuss the applications of interactive big data technologies.
Parallel Programming with Spark (Part 1 & 2) – Matei Zaharia, Founder of Spark
Part 1: A brief intro to Scala and exploring data in the Spark Shell. Part 2: Writing standalone Spark programs using Scala or Java.
Strata 2014: Matei Zaharia, “How Companies are Using Spark, and Where the Edge in Big Data Will Be”
While the first big data systems made a new class of applications possible, organizations must now compete on the speed and sophistication with which they can draw value from data. Future data processing platforms will need to not just scale cost-effectively; but to allow ever more real-time analysis, and to support both simple queries and today’s most sophisticated analytics algorithms. Through the Spark project at Apache and Berkeley, we’ve brought six years research to enable real-time and complex analytics within the Hadoop stack.