Data Cleaning: What, Why, How?

We have all been there. You import your dataset to your favourite analysis tool, you choose your model, and as you try to fit it to the data, your software throws up an error - your dataset contains nonsense!

Missing data is just one of the problems that most data scientists encounter every day. And nonsense data being recognised as such is, in fact, the best-case scenario. Missing data can be encoded in many different ways (I have seen missing data labelled as -1, -999, ‘missing’, ‘went for a coffee’ and many others in my time as a data scientist), not all of which will be identified as nonsense by your analysis tool. This is why a thorough exploration of the data should supersede any analysis.

Data exploration

Here are some key steps you can apply before starting any analysis:

Get some summary statistics for each variable. Find the minimum, the maximum., the mean, the median, etc. That should let you discover if any measurements are out of the expected range, and hence identify possible mislabelled data. Replace mislabelled data with the NaN operator of your chosen data analysis tool.
Have a look at your data. Literally. Plot every variable against every other variable using e.g. a scatterplot matrix. That will further allow you to identify outliers, obvious correlations and colinearities and may give you some idea of the level of noise in your data.

Missing data

There can be missing data in your dataset for many reasons. Maybe a sensor was dysfunctional for a while. Maybe the measurements were taken manually, and someone forgot or didn’t have time to do some of the measurements. Maybe you are working with survey data, and a respondent did not want to answer one or more of the survey questions. Nearly every dataset I have ever handled has had missing data for one or the other reason.

The first course of action when handling missing data is to get them labelled as such. As mentioned above, there are many different approaches to labelling missing data, some of which are numerical and would not necessarily be detected before fitting the model and hence biasing the results with non-physical measurements. This can possibly be solved by thorough data exploration, as described above, but should ideally be handled in a more rigorous fashion, by knowing your data. When you receive a new dataset from a third party, ask for a companion codebook, where the labels used are explained, so that you have some prior knowledge of what to expect from the dataset.

Imputation

If the goal of your analysis is prediction, rather than inference/explanatory analysis, it may be desirable to perform imputation on the data to fill in any missing values and hence preserve the volume of the data, rather than simply throwing missing data away. If, on the other hand, you are doing an explanatory analysis using some manner of (generalised) linear model, doing imputation can be dangerous and may lead to biasing of your resulting regression coefficients.

There are many different approaches to data imputation, which essentially boils down to filling in your missing observations with some (educated) guess of what that value might have been if it had been measured. Such approaches include:

Filling in missing values with the mean/median of the respective value.
Filling in missing values with random numbers distributed similarly to the existing values.
Using some more sophisticated data imputation method, such as using another model to generate predictions or the missing values. Sensible choices would be to use k-Nearest Neighbours to infer the missing values or a bespoke imputation algorithm such as MICE (Multiple Imputation by Chained Equations).

Noisy data

Data can have various types of noise, including sensor noise, environmental noise, mislabelled data and noise resulting from approximations and transformations. Different types of noise require different denoising techniques. On a high level, they boil down to finding some patterns in the data that would hopefully reveal any problematic data points.

Correlated data

When the data is correlated, typically if the measurements are either spatially or temporally coupled, it is relatively simple to separate the signal from the noise. This can be achieved through some form of truncated projection, either with a truncated Singular Value Decomposition, truncated Fourier transform or some other means of separating lower frequency oscillations (typically signal) from higher frequency oscillations (noise).

Uncorrelated data

When the data is noisy, and there are no obvious higher-order correlations that can help detect such issues, separating signal from noise can be very tricky. The two most obvious solutions, in my opinion, are (1) to use some anomaly detection algorithm to flag potentially noisy and mislabelled data points and then manually correct or drop suspicious data points; and (2) to bake the uncertainty in the data into the model. This can be achieved by using, e.g. a Gaussian Process Regressor/Classifier, which will simultaneously estimate both the model and the noise present in the data.

Approximation and transformation error

The data you are handling might not be “vanilla” in the sense that it may have been post-processed, either programmatically or by a human being, before landing on your desk. This can result in errors reflective of the transformations applied to the data. There is no obvious cure for this problem, and it again highlights the importance of companion codebooks with each dataset, that rigorously explain how each of the variables has been constructed. Ask for a codebook when you get a new dataset.

Online data cleaning

If you receive data continuously from e.g. a sensor, you may want to clean the data online, while receiving the data stream.

Missing data

If the sensor is down, it would typically send some error code, which should then be labelled as missing data. You may want to make predictions for that sensor when it is down (since it could feed into some process flow control). In this case, it is important to separate predictions from actual reading, and label predictions as such in your log, so that if the data is processed further down the line, it is clear that it is not real. Such predictions would typically be the output of some model, and, in this respect, may be wrong. If sensor measurements are routinely replaced with model predictions, it would be sensible to use some model that provides a measure of its uncertainty along with the predictions, such as a Gaussian Process Regression.

Noisy data

If your incoming data are noisy, you may want to filter them before using them to make decisions. This can be achieved by filtering the incoming data stream with, e.g. a Kalman Filter, Particle Filter or some other means of denoising the data continuously.

About digiLab

digiLab is an agile, deep tech company operating as a trusted partner to help industry leaders make actionable decisions through digital twin solutions. We invent first-of-a-kind, world-leading machine learning solutions which empower senior decision-makers to exploit the value locked in their data and models.

Data Cleaning: What, Why, How?

Data exploration

Missing data

Imputation

Noisy data

Correlated data

Uncorrelated data

Approximation and transformation error

Online data cleaning

Missing data

Noisy data

About digiLab

Featured Posts