This guide is for setting up a local Kafka environment in a single machine quickly. A properly configured Kafka cluster is laborious to set up and has a lot of dependencies like Apache ZooKeeper etc. and that can be disconcerting if you just want to develop locally or put together a proof of concept prototype.
This is obviously not the way to set up a production Kafka cluster as there is no security, redundancy, fault tolerance, scaling and a host of other requirements for a production grade environment.
For many years I have been yearning to have a single operating system that I run for all my work and leisure needs, but I have always had to run Linux for my software development related tasks, and Microsoft Windows for general usage.
Linux works well for me when doing programming work due to the powerfull bash shell, the Linux command line utilities (like git, grep, cat, etc) and how well they integrate with the shell and a host of development tools that just works so naturally in Linux, like python and node.
Following on my post on setting up a platform to get started with data science tools I since have set up a Jupyter based platform for programming Python on Spark.
On top of using Python libraries (like pandas, NumPy, Scikit-Learn, etc) that makes data analysis easier, in this platform I can also use Spark to code applications that run on distributed clusters
This setup has the following benefits
It is web based, I can work on my projects from anywhere as long as I have a web browser with an internet connection It is set up using light weight EC2 instance types (t2.
I was recently asked to solve a data science related challenge for a job application, the challenge was to simply write a Spark application that determines the top 10 most rated TV series’ genres of TV series with over 10 episodes. The challenge required the solution to be written in Scala using the SBT tool.
I later wrote the solution again using Python which I am more comfortable with, here are my notes on the Python solution.
In this post I am documenting what steps I made to convert a traditional NodeJS App that is launched from a command line using
node app.js Into a fully dockerized container solution. The App uses a MySQL database which it has static configuration for. I am not going into too much details about the App’s code or architecture but it is just worth noting that it has this piece of configuration for connecting to the database;
Carrying on with Jeff Patton’s “User Story Mapping” for my project, after slicing releases it is time to put a framework in place to learn faster
Once you have your product idea and asked yourself who are your customers, how they will use it and why the need it, you need to validate the problem your product will solve really exists. Find a handful of people from your target market and try to engage them.
Carrying on with Jeff Patton’s “User Story Mapping” for my project, following on from Framing the Big Picture the next step is to plan on building less, because
There’s always more to build than you have people, time and money for
Story mapping helps big groups build shared understanding, if the product has stories that crosses multiple teams’ domains get all the teams together so that you can map for a product release across all of the teams, this will help visualize the dependencies across the teams.
Reading on Jeff Patton’s “User Story Mapping” I have been applying the ideas in a small project I am working on - an online grocery shopping service gengeni.com. In this post I am documenting focusing on the big picture.
Jeff insists on creating documents which promotes a shared understanding through user stories (rather than the traditional requirements, which are prone to mis interpretations). He insists that we are building software not for the sake of it but to make things better, solve real world problems, therefore we should focus on maximising the outcome (how we make things better) while minimizing the output (software components).
NumPy NumPy is the core library for scientific computing in Python. It It adds support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.
Installation The easiest way to get NumPy installed is to install one of the Python distirbutions, like Anaconda, which include all of the key packages for the scientific computing stack
Usage To start using NumPy you need to import it
Data Science Getting Started Platform To get started quickly with data science, I started looking at python and its powerful set of libraries (like pandas, NumPy, Scikit-Learn, etc) that makes data analysis easier. I wanted to have a platform that is accessible over the internet so I can get to it from any laptop/PC that has internet access.
I decided to get a minimal Virtual Private Server (VPS) that supports containers so I can set up a Docker container with all the languages and frameworks/libraries/tools and mount a path on the VPS that contains all the projects I am working on, which will be checked in to git.