Apache SparkĀ is a fast and general engine for large-scale data processing. As I wrote on my article on Hadoop, big data sets required a better way of processing because traditional RDBMS simply can’t cope and Hadoop has revolutionized the industry by making it possible to process these data sets using horizontally scalabale clusters of commodity hardware.
However Hadoop’s own compute engine, MapReduce, is limited in having a single programming model of using Mappers and Reducers and also being tied to reading and writing data to the filesystem during the processing of the data which slows down the process.
Traditionally data analysis has been done on Relational Database Management Systems (RDBMS) which work on data with a clearly defined structure since they require a schema definition before the data can be loaded.
RDBMS also scale better vertically rather than horizontally, meaning scaling is done through using higher capacity machines rather than spreading the load through many machines as replication of RDBMS data across the machines tend to be problematic.
Following on my post on setting up a platform to get started with data science tools I since have set up a Jupyter based platform for programming Python on Spark.
On top of using Python libraries (like pandas, NumPy, Scikit-Learn, etc) that makes data analysis easier, in this platform I can also use Spark to code applications that run on distributed clusters
This setup has the following benefits
It is web based, I can work on my projects from anywhere as long as I have a web browser with an internet connection It is set up using light weight EC2 instance types (t2.
I was recently asked to solve a data science related challenge for a job application, the challenge was to simply write a Spark application that determines the top 10 most rated TV series’ genres of TV series with over 10 episodes. The challenge required the solution to be written in Scala using the SBT tool.
I later wrote the solution again using Python which I am more comfortable with, here are my notes on the Python solution.
NumPy NumPy is the core library for scientific computing in Python. It It adds support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.
Installation The easiest way to get NumPy installed is to install one of the Python distirbutions, like Anaconda, which include all of the key packages for the scientific computing stack
Usage To start using NumPy you need to import it