An end to end guide on how to reduce a dataset dimensionality using Feature Extraction Techniques such as: PCA, ICA, LDA, LLE, t-SNE and AE.

Introduction

It is nowadays becoming quite common to be working with datasets of hundreds (or even thousands) of features. If the number of features becomes similar (or even bigger!) than the number of observations stored in a dataset then this can most likely lead to a Machine Learning model suffering from overfitting. In order to avoid this type of problem, it is necessary to apply either regularization or dimensionality reduction techniques (Feature Extraction). In Machine Learning, the dimensionali of a dataset is equal to the number of variables used to represent it.

Using Regularization could certainly help reduce the risk…


An investigation through some of the main limitations Artificial Intelligence-powered systems are facing

Introduction

Thanks to recent advancements in Artificial Intelligence (AI), we are now able to leverage Machine Learning and Deep Learning technologies in both academic and commercial applications. Although, relying just on correlations between the different features, can possibly lead to wrong conclusions since correlation does not necessarily imply causation. Two of the main limitations of nowadays Machine Learning and Deep Learning models are:

  • Robustness: trained models might not be able to generalise to new data and therefore would not be able to provide robust and reliable performances in the real world.
  • Explainability: complex Deep Learning models can be difficult to analyse…


Model Interpretability

A guide on how to prevent bias in Machine Learning models and understand their decisions.

“Although neural networks might be said to write their own programs, they do so towards goals set by humans, using data collected for human purposes. If the data is skewed, even by accident, the computers will amplify injustice.”

— The Guardian [1]

Introduction

Application of Machine Learning in ambits such as medicine, finance and education is still nowadays quite complicated due to the ethical concerns surrounding the use of algorithms as automatic decision-making tools.

Two of the main causes at the root of this mistrust are: bias and low explainability. In this article, we are going to explore both of these…


A practical introduction on how to create online Machine Learning models able to train one data point at the time processing streaming data

Online Learning

Online Learning, is a subset of Machine Learning which emphasizes the fact that data generated from environments can change over time.

In fact, traditional Machine Learning models are instead considered to be static: once a model is trained on a set of data it’s parameters don’t change any more. Although, an environment and it’s generated data might change over time making, therefore, our pre-trained model not reliable anymore.

One simple solution which is commonly used by companies in order to solve these problems is to retrain and deploy an updated version of the Machine Learning model automatically once the performance…


A practical introduction to creating and optimizing Reinforcement Learning agents using Open AI Gym and Ray Python libraries

Introduction

Open AI Gym is an open-source interface for typical Reinforcement Learning (RL) tasks. Using Open AI Gym can, therefore, become really easy to get started with Reinforcement Learning since we are already provided with a wide variety of different environments and agents. All we have been left to do is to code up different algorithms and test them in order to make our agent learn how to accomplish in the best possible way different tasks (as we will see later, this can easily be done using Ray). …


Demystifying some of the main concepts and terminologies associated with Reinforcement Learning and their association with other fields of AI

Reinforcement Learning Application Example
Reinforcement Learning Application Example

Introduction

Today, Artificial Intelligence (AI) has undergone impressive advancements. AI can be subdivided into three different levels according to the ability of machines to perform intellectual tasks logically and independently:

  • Narrow AI: machines are more efficient than humans in performing very specific tasks (but not trying to perform other types of tasks).
  • General AI: machines are as intelligent as human beings.
  • Strong AI: machines perform better than humans in different ambit (in tasks that we might or not be able to perform at all).

Right now, thanks to Machine Learning, we have been able to achieve good competency at the Narrow…


Using Reveal.js and D3.js in order to create interactive online data science presentations

Introduction

Being able to summarise data science projects and show their potential business benefits can play a really important role in securing new customers and make it easier for non-technical audiences to understand some key design concepts.

In this article, I am going to introduce you to two free programming frameworks which can be used in order to create interactive online presentations and data-based storytelling reports.

Reveal.js

Reveal.js is an open-source presentation framework completely built on open web technologies. Using Reveal.js, can then be possible to easily create web-based presentations and to export them in other formats such as PDF.

Some of…


Hands-on Tutorials

Using Data Science and Machine Learning even when there is no data available.

Introduction

One of the main limitations of the current state of Machine Learning and Deep Learning, is the constant need for new data. Although, how it can be possible then to make estimates and predictions for situations in which we don’t have any data available? This can in fact be more common than we would normally think.

As an example, let’s consider a thought experiment: we are working as a Data Scientist for a local authority and we are asked to find a way in order to optimise how the evacuation plan should work in case of a natural disaster (e.g…


OPINION

Introduction to some of the most promising programming languages for Data Science and Cloud Development

Introduction

As of 2020, there are about 700 programming languages available [1]. Some of these tend to be applied just for specific domains while others are widely appreciated for their ability to be able to work in a wide range of applications. During the past decade, there has been an almost steady growth in the application of software and new languages have been developed in order to meet the demand. In this article, we are going to explore some of the most currently used programming languages and potential new stars in the ambit of Data Science and Cloud Development.

Deciding to…


A guide through the foundations of cloud technology and its applications in Data Science

Introduction

Nowadays, more and more companies are moving to developing and deploying applications in cloud-based environments. One of the main motivations for cloud computing is that it gets rid of all the problematics associated with setting up and managing the used hardware and software. This is accomplished by remotely renting computer resources available in data centres maintained by a Cloud provider.

In this way, companies and individuals can make use remotely of the hardware and software setups and workflows provided by different cloud providers without having to worry about buying the equipment, set up different environments and maintain them over time…

Pier Paolo Ippolito

Data Scientist at SAS, TDS Associate Editor and Freelancer. https://pierpaolo28.github.io/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store