An end to end guide on how to reduce a dataset dimensionality using Feature Extraction Techniques such as: PCA, ICA, LDA, LLE, t-SNE and AE.



It is nowadays becoming quite common to be working with datasets of hundreds (or even thousands) of features. If the number of features becomes similar (or even bigger!) than the number of observations stored in a dataset then this can most likely lead to a Machine Learning model suffering from overfitting. In order to avoid this type of problem, it is necessary to apply either regularization or dimensionality reduction techniques (Feature Extraction). In Machine Learning, the dimensionali of a dataset is equal to the number of variables used to represent it.

Using Regularization could certainly help reduce the risk…

A look into some of the main paradoxes associate with Data Science and it’s statistical foundations

Photo by Shadan Arab on Unsplash


Paradoxes are a class of phenomena which arise when, although starting from premises known as true, we derive some sort of logically unreasonable result. As Machine Learning models create knowledge from data, this makes them susceptible to possible cognitive paradoxes between training and testing.

In this article, I will walk you through some of the main paradoxes associated with Data Science and how they can be identified:

  • Simpson’s Paradox
  • Accuracy Paradox
  • Learnability-Godel Paradox
  • The Law of Unintended Consequences

Simpson’s Paradox

One of the most common forms of paradox in Data Science is Simpson’s Paradox.

As an example, let us consider a thought…

Introduction to some of the most common techniques which can be used in order to query information from data for interpretable inference.

Photo by Riccardo Pelati on Unsplash


Two of the main techniques used in order to try to discover causal relationships are Graphical Methods (such as Knowledge Graphs and Bayesian Belief Networks) and Explainable AI. These two methods form in fact the basis of the Association level in the Causality Hierarchy (Figure 1), enabling us to answer questions such as: What different properties compose an entity and how are the different components related each other?

In case you are interested in finding out more about how Causality is used in Machine Learning, more information is available in my previous article: Causal Reasoning in Machine Learning.

An investigation through some of the main limitations Artificial Intelligence-powered systems are facing

Photo by Dan Meyers on Unsplash


Thanks to recent advancements in Artificial Intelligence (AI), we are now able to leverage Machine Learning and Deep Learning technologies in both academic and commercial applications. Although, relying just on correlations between the different features, can possibly lead to wrong conclusions since correlation does not necessarily imply causation. Two of the main limitations of nowadays Machine Learning and Deep Learning models are:

  • Robustness: trained models might not be able to generalise to new data and therefore would not be able to provide robust and reliable performances in the real world.
  • Explainability: complex Deep Learning models can be difficult to analyse…

Model Interpretability

A guide on how to prevent bias in Machine Learning models and understand their decisions.

Photo by Nicolas Jossi on Unsplash

“Although neural networks might be said to write their own programs, they do so towards goals set by humans, using data collected for human purposes. If the data is skewed, even by accident, the computers will amplify injustice.”

— The Guardian [1]


Application of Machine Learning in ambits such as medicine, finance and education is still nowadays quite complicated due to the ethical concerns surrounding the use of algorithms as automatic decision-making tools.

Two of the main causes at the root of this mistrust are: bias and low explainability. In this article, we are going to explore both of these…

A practical introduction on how to create online Machine Learning models able to train one data point at the time processing streaming data

Photo by O12 on Unsplash

Online Learning

Online Learning, is a subset of Machine Learning which emphasizes the fact that data generated from environments can change over time.

In fact, traditional Machine Learning models are instead considered to be static: once a model is trained on a set of data it’s parameters don’t change any more. Although, an environment and it’s generated data might change over time making, therefore, our pre-trained model not reliable anymore.

One simple solution which is commonly used by companies in order to solve these problems is to retrain and deploy an updated version of the Machine Learning model automatically once the performance…

A practical introduction to creating and optimizing Reinforcement Learning agents using Open AI Gym and Ray Python libraries

(Image Reproduced from


Open AI Gym is an open-source interface for typical Reinforcement Learning (RL) tasks. Using Open AI Gym can, therefore, become really easy to get started with Reinforcement Learning since we are already provided with a wide variety of different environments and agents. All we have been left to do is to code up different algorithms and test them in order to make our agent learn how to accomplish in the best possible way different tasks (as we will see later, this can easily be done using Ray). …

Demystifying some of the main concepts and terminologies associated with Reinforcement Learning and their association with other fields of AI

Reinforcement Learning Application Example
Photo by Lenin Estrada on Unsplash


Today, Artificial Intelligence (AI) has undergone impressive advancements. AI can be subdivided into three different levels according to the ability of machines to perform intellectual tasks logically and independently:

  • Narrow AI: machines are more efficient than humans in performing very specific tasks (but not trying to perform other types of tasks).
  • General AI: machines are as intelligent as human beings.
  • Strong AI: machines perform better than humans in different ambit (in tasks that we might or not be able to perform at all).

Right now, thanks to Machine Learning, we have been able to achieve good competency at the Narrow…

Using Reveal.js and D3.js in order to create interactive online data science presentations

Photo by Campaign Creators on Unsplash


Being able to summarise data science projects and show their potential business benefits can play a really important role in securing new customers and make it easier for non-technical audiences to understand some key design concepts.

In this article, I am going to introduce you to two free programming frameworks which can be used in order to create interactive online presentations and data-based storytelling reports.


Reveal.js is an open-source presentation framework completely built on open web technologies. Using Reveal.js, can then be possible to easily create web-based presentations and to export them in other formats such as PDF.

Some of…

Hands-on Tutorials

Using Data Science and Machine Learning even when there is no data available.

Photo by Kristopher Allison on Unsplash


One of the main limitations of the current state of Machine Learning and Deep Learning, is the constant need for new data. Although, how it can be possible then to make estimates and predictions for situations in which we don’t have any data available? This can in fact be more common than we would normally think.

As an example, let’s consider a thought experiment: we are working as a Data Scientist for a local authority and we are asked to find a way in order to optimise how the evacuation plan should work in case of a natural disaster (e.g…

Pier Paolo Ippolito

Data Scientist at SAS, TDS Associate Editor and Freelancer.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store