How to preprocess a dataset with many types of missing data

Question

I'm trying to do the beginner machine learning project Big Mart Sales. The data set of this project contains many types of missing values (NaN), and values that need to be changed (lf -> Low Fat, reg -> Regular, etc.)

My current approach to preprocess this data is to create an imputer for every type of data needs to be fixed:

from sklearn.impute import SimpleImputer as Imputer

# make the values consistent
lf_imputer = Imputer(missing_values='LF', strategy='constant', fill_value='Low Fat')
lowfat_imputer = Imputer(missing_values='low fat', strategy='constant', fill_value='Low Fat')
X[:,1:2] = lf_imputer.fit_transform(X[:,1:2])
X[:,1:2] = lowfat_imputer.fit_transform(X[:,1:2])

# nan for a categorical variable
nan_imputer = Imputer(missing_values=np.nan, strategy='most_frequent')
X[:, 7:8] = nan_imputer.fit_transform(X[:, 7:8])

# nan for a numerical variable
nan_num_imputer = Imputer(missing_values=np.nan, strategy='mean')
X[:, 0:1] = nan_num_imputer.fit_transform(X[:, 0:1])

However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?

In addition, it is frustrating that imputer.fit_transform() requires a 2D array as an input whereas I only want to fix the values in a single column (1D). Thus, I always have to use the column that I want to fix plus a column next to it as inputs. Is there any other way to get around this? Thanks.

Here are some rows of my data:

could you some rows of your dataframe to this post? and do you have only numerical data? — PV8, Oct 10 '19 at 07:21
Yes, I have edited the post and added a screenshot of my data. — Eric Cartman, Oct 10 '19 at 20:58

score 2 · Accepted Answer · answered Jul 08 '20 at 13:42

There is a python package which can do this for you in a simple way, ctrl4ai

pip install ctrl4ai

from ctrl4ai import preprocessing

preprocessing.impute_nulls(dataset)

Usage: [arg1]:[pandas dataframe],[method(default=central_tendency)]:[Choose either central_tendency or KNN]
Description: Auto identifies the type of distribution in the column and imputes null values
Note: KNN consumes more system mermory if the size of the dataset is huge
Returns: Dataframe [with separate column for each categorical values]

Good Work! It will reduce loc! – Parvathirajan Natarajan Jun 10 '21 at 13:09 — Parvathirajan Natarajan, Jun 10 '21 at 13:09

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

0

However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?

If you have a numerical column, you can use some approaches to fill the missing data:

A constant value that has meaning within the domain, such as 0, distinct from all other values.
A value from another randomly selected record.
A mean, median or mode value for the column.
A value estimated by another predictive model.

Lets see how it works for a mean for one column e.g.: One method would be to use fillna from pandas:

X['Name'].fillna(X['Name'].mean(), inplace=True)

For categorical data please have a look here: Impute categorical missing values in scikit-learn

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 10 '19 at 07:27

PV8

5,799
7
43
87

I tried what you suggested and got the error: AttributeError: 'numpy.ndarray' object has no attribute 'fillna' – Eric Cartman Oct 10 '19 at 20:41
I taught it is a dataframe, does not work for `X['Item_Weight']`? – PV8 Oct 11 '19 at 05:49

How to preprocess a dataset with many types of missing data

2 Answers2