Member-only story

Getting Started with Apache Spark

Exploring some of the key concepts associated with Spark, and what defined its success in the Big Data realm

Pier Paolo Ippolito

Published in

TDS Archive

6 min readOct 14, 2022

Introduction

One of the main issues that characterized the inception of the internet as we know it today was the inability to scale (e.g. being able to search large amounts of data in a short time, support a constantly varying amount of users without any downtime, etc…).

This was ultimately due to the fact that in an online world new data of varying shapes and sizes is constantly being generated and processed (additional information about Big Data Analysis is available in my previous article).

One of the first approaches taken in order to try to solve this problem was Apache Hadoop (High Availability Distributed Object Oriented Platform). Hadoop is an open-source platform designed in order to break down data jobs into small chunks and distribute them as jobs across a cluster of computing nodes in order to be able to process them in parallel. Hadoop can be broken down into 3 key components:

HDFS (Hadoop Distributed File System): a distributed file system designed for large data support and fault tolerance.
MapReduce: a framework designed in order to facilitate the…

What I applaud in Mr. Carnevale's piece is an attempt to shift the attention away from affirmative action and college (now, there's a silo) to education reform of K-12 education. In my view, the problem starts when humans by and large support their…

TDS Archive

Getting Started with Apache Spark

Exploring some of the key concepts associated with Spark, and what defined its success in the Big Data realm

Introduction

Create an account to read the full story.

Published in TDS Archive

Written by Pier Paolo Ippolito

No responses yet