Member-only story
Getting Started with Apache Spark
Exploring some of the key concepts associated with Spark, and what defined its success in the Big Data realm
Introduction
One of the main issues that characterized the inception of the internet as we know it today was the inability to scale (e.g. being able to search large amounts of data in a short time, support a constantly varying amount of users without any downtime, etc…).
This was ultimately due to the fact that in an online world new data of varying shapes and sizes is constantly being generated and processed (additional information about Big Data Analysis is available in my previous article).
One of the first approaches taken in order to try to solve this problem was Apache Hadoop (High Availability Distributed Object Oriented Platform). Hadoop is an open-source platform designed in order to break down data jobs into small chunks and distribute them as jobs across a cluster of computing nodes in order to be able to process them in parallel. Hadoop can be broken down into 3 key components:
- HDFS (Hadoop Distributed File System): a distributed file system designed for large data support and fault tolerance.
- MapReduce: a framework designed in order to facilitate the…