TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial…

Follow publication

Member-only story

Getting Started with Apache Spark

Pier Paolo Ippolito
TDS Archive
Published in
6 min readOct 14, 2022
Photo by Ian Schneider on Unsplash

Introduction

One of the main issues that characterized the inception of the internet as we know it today was the inability to scale (e.g. being able to search large amounts of data in a short time, support a constantly varying amount of users without any downtime, etc…).

This was ultimately due to the fact that in an online world new data of varying shapes and sizes is constantly being generated and processed (additional information about Big Data Analysis is available in my previous article).

One of the first approaches taken in order to try to solve this problem was Apache Hadoop (High Availability Distributed Object Oriented Platform). Hadoop is an open-source platform designed in order to break down data jobs into small chunks and distribute them as jobs across a cluster of computing nodes in order to be able to process them in parallel. Hadoop can be broken down into 3 key components:

  • HDFS (Hadoop Distributed File System): a distributed file system designed for large data support and fault tolerance.
  • MapReduce: a framework designed in order to facilitate the…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Pier Paolo Ippolito
Pier Paolo Ippolito

No responses yet

Write a response