Visualize Missing Data with missingno

When preparing datasets prior to data analysis or machine learning, we sometimes find missing values. Missing values are generally represented with NaN which stands for Not a Number, and this can be quite an issue as many machine learning algorithms can’t handle missing data and require entire rows, where a single missing value is present, to be deleted or replaced (imputed) with a new value.

 

What is missingno?

Missingno is a Python library that uses helpful graphics to help you understand the distribution of missing values. Heatmap and bar plot are examples of visualizations that missingno has. With this library, you can see where missing data have appeared and check the correlation between the columns containing missing values and the target column. Once the dataset has been thoroughly investigated, missing data can be better handled. Let’s put this into practice and see how it improves our data pre-processing.

Missingno Implementation

We use Google Colab in this tutorial. You can also use Jupyter Notebooks or any other similar tools for quick demonstration.

missingno installation page is available here

Installation

First, let’s install missingno with pip command.

Preparing Dataset and Import missingno

As our data is stored in Google Drive, let’s mount Google Colab with our Drive.

Then, import missingno library and load the dataset.
We can simply replace /path/to/dataset/example.csv  with the actual path of our dataset in Google Drive.
Let’s have a quick view of our dataset.

Viewing missing data with Pandas

Before using missingno, let’s see how we can identify missing data without missingno.
Here’s a few Pandas Library features that can give us an initial insight into how much missing data we have.

Using missingno

Now let’s see how missingno can give us missing data distribution visualizations.

Within the missingno library, there are 4 types of plots for visualizing data completeness: Bar Plot, Matric Plot, Heatmap, Dendrogram.

Bar Plot

1 represents the completeness of a column. The ones with shorter bars have more missing values.


Matrix Plot

Matrix plot shows the distribution of missing data.


Heatmap

Heatmap is used to identify correlations of the nullity between each of the different columns.


Dendogram

Dendrogram plot provides a tree-like graph generated through hierarchical clustering and groups together columns that have strong correlations in nullity.


 

Ultimately, missingno can help us understand our dataset’s missing data better prior to data analysis and machine learning workflow by showing how much missing data is present, where it happens, and how the missing data related to other data in different columns.

この記事が気に入ったら
いいね ! しよう

Twitter で
The following two tabs change content below.
アバター
Esar Suwandi
Software engineer based in Tokyo focused on Web Development. 4x AWS Certified: CLF, SAA, SAP, DOP. Let's build something awesome! ― Web開発に特化したソフトウェアエンジニアです。 4x AWS認定者です。素晴らしいものを作りましょう!

【採用情報】一緒に働く仲間を募集しています

採用情報
ページトップへ