Articles

Apache Spark

Apache SparkĀ is a fast and general engine for large-scale data processing. As I wrote on my article on Hadoop, big data sets required a better way of processing because traditional RDBMS simply can’t cope and Hadoop has revolutionized the industry by making it possible to process these data sets using horizontally scalabale clusters of commodity hardware. However Hadoop’s own compute engine, MapReduce, is limited in having a single programming model of using Mappers and Reducers and also being tied to reading and writing data to the filesystem during the processing of the data which slows down the process.

Hadoop

Traditionally data analysis has been done on Relational Database Management Systems (RDBMS) which work on data with a clearly defined structure since they require a schema definition before the data can be loaded. RDBMS also scale better vertically rather than horizontally, meaning scaling is done through using higher capacity machines rather than spreading the load through many machines as replication of RDBMS data across the machines tend to be problematic.

Continous Deployments

Continuous Deployments is the next stage of automation following on from it’s predecessors continuous integration (CI) and continuous delivery (CD). The integration phase of the project used to be the most painful step, depending on the size of the project developers work on isolated teams dedicated to seperate components of the application for a very long time, when the time came to integrate those components a lot of issues, like unmet dependencies, interfaces that don’t communicate etc, are dealt with for the first time - the idea of CI was thought out to combat this problem.

Service Discovery and Proxying

Delivery of software as microservices running on immutable and self-sufficient containers is a very sobust method and has gained a lot of popularity in the recent years. Containers usually expose tyhe microservice as a web service acccessible through a certain port number on the host. Because host machines are able to run many conatiners and the fact that these containers need to be started and shut down quickly and easily without any side effects, it is not really feasible for consumers of these web services to point to manually assigned hosts and ports.

Blue/Green Deployments

Traditionally deploying a new release and making it live in production involves replacing the existing release with the new one, this leads to a period of downtime which may be considerable for large, monolithic applications. The solutions adopted to tackle this problem are usually some variation of the o called blue-green deployment process. The diagram below illustrates this set up in which all public traffic is routed through a reverse proxy (like Nginx or HAProxy) that forwards request to the correct release of the application (which then interacts with its correct instance of a database)

Containers

Virtual Machines In the quest for maximising efficiency of computing power available on servers Virtual Machines (VMs) came into existence, with products from firms like VMware and Virtualbox pushing the concept to general users. “In computing, virtualization refers to the act of creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, operating systems, storage devices, and computer resources.” - Wikipedia Virtual Machines are created on top of hypervisors which run on top of the host machine’s operating system (OS), the hypervisors allow emulation of hardrware like CPU, Disk, Memory, Network etc and server machines can be configured to create a pool of emulated hardware resources available for applications in the process making the actual harware resources on those server utilized much more efficiently.