yarn

Apache Spark

Apache SparkĀ is a fast and general engine for large-scale data processing. As I wrote on my article on Hadoop, big data sets required a better way of processing because traditional RDBMS simply can’t cope and Hadoop has revolutionized the industry by making it possible to process these data sets using horizontally scalabale clusters of commodity hardware. However Hadoop’s own compute engine, MapReduce, is limited in having a single programming model of using Mappers and Reducers and also being tied to reading and writing data to the filesystem during the processing of the data which slows down the process.

Hadoop

Traditionally data analysis has been done on Relational Database Management Systems (RDBMS) which work on data with a clearly defined structure since they require a schema definition before the data can be loaded. RDBMS also scale better vertically rather than horizontally, meaning scaling is done through using higher capacity machines rather than spreading the load through many machines as replication of RDBMS data across the machines tend to be problematic.