0

I'm trying to do the beginner machine learning project Big Mart Sales. The data set of this project contains many types of missing values (NaN), and values that need to be changed (lf -> Low Fat, reg -> Regular, etc.)

My current approach to preprocess this data is to create an imputer for every type of data needs to be fixed:

from sklearn.impute import SimpleImputer as Imputer

# make the values consistent
lf_imputer = Imputer(missing_values='LF', strategy='constant', fill_value='Low Fat')
lowfat_imputer = Imputer(missing_values='low fat', strategy='constant', fill_value='Low Fat')
X[:,1:2] = lf_imputer.fit_transform(X[:,1:2])
X[:,1:2] = lowfat_imputer.fit_transform(X[:,1:2])

# nan for a categorical variable
nan_imputer = Imputer(missing_values=np.nan, strategy='most_frequent')
X[:, 7:8] = nan_imputer.fit_transform(X[:, 7:8])

# nan for a numerical variable
nan_num_imputer = Imputer(missing_values=np.nan, strategy='mean')
X[:, 0:1] = nan_num_imputer.fit_transform(X[:, 0:1])

However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?

In addition, it is frustrating that imputer.fit_transform() requires a 2D array as an input whereas I only want to fix the values in a single column (1D). Thus, I always have to use the column that I want to fix plus a column next to it as inputs. Is there any other way to get around this? Thanks.

Here are some rows of my data: enter image description here

2 Answers2

2

There is a python package which can do this for you in a simple way, ctrl4ai

pip install ctrl4ai

from ctrl4ai import preprocessing

preprocessing.impute_nulls(dataset)

Usage: [arg1]:[pandas dataframe],[method(default=central_tendency)]:[Choose either central_tendency or KNN]
Description: Auto identifies the type of distribution in the column and imputes null values
Note: KNN consumes more system mermory if the size of the dataset is huge
Returns: Dataframe [with separate column for each categorical values]
0

However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?

If you have a numerical column, you can use some approaches to fill the missing data:

  • A constant value that has meaning within the domain, such as 0, distinct from all other values.
  • A value from another randomly selected record.
  • A mean, median or mode value for the column.
  • A value estimated by another predictive model.

Lets see how it works for a mean for one column e.g.: One method would be to use fillna from pandas:

X['Name'].fillna(X['Name'].mean(), inplace=True) 

For categorical data please have a look here: Impute categorical missing values in scikit-learn

Community
  • 1
  • 1
PV8
  • 5,799
  • 7
  • 43
  • 87