4

Is there a way to impute categorical values using a sklearn.preprocessing object? I would like to ultimatly create a preprocessing object which I can apply to new data and have it transformed the same way as old data.

I am looking for a way to do it so that I can use it this way.

user1367204
  • 4,549
  • 10
  • 49
  • 78

4 Answers4

2

Yes, it is possible. For example, you can use sklearn.preprocessing.Imputer with parameter strategy = 'most_frequent'.

Use fit_transform method to apply it to old data (train set) and then transform on new data (test set).

slonopotam
  • 1,640
  • 1
  • 13
  • 10
  • Well, I assumed it's already numeric(integer) categorical. If categorical data is in string format, it first needs to be translated to numeric by, for example, sklearn.LabelEncoder – slonopotam Mar 17 '17 at 11:46
  • 1
    When I use LabelEncoder I lose the numpy.NaN fields, they get turned into a number, and then I can't do use the Imputer in the next step. – user1367204 Mar 17 '17 at 11:59
  • @user1367204, you can still use this number with Imputer, just pass it as parameter missing_values. Probably there is cleaner solution, but this one works too... – slonopotam Mar 17 '17 at 12:24
2

Imputers from sklearn.preprocessing works well for numerical variables. But for categorical variables, mostly categories are strings, not numbers. To be able to use sklearn's imputers, you need to convert strings to numbers, then impute and finally convert back to strings.

A better option is to use CategoricalImputer() from he sklearn_pandas package.

It replaces null-like values with the mode and works with string columns. sklearn-pandas package can be installed with pip install sklearn-pandas, and can be imported as import sklearn_pandas

Dr Nisha Arora
  • 632
  • 1
  • 10
  • 23
0

Copying and modifying this answer, I made an imputer for a pandas.Series object

import numpy
import pandas 

from sklearn.base import TransformerMixin


class SeriesImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        If the Series is of dtype Object, then impute with the most frequent object.
        If the Series is not of dtype Object, then impute with the mean.  

        """
    def fit(self, X, y=None):
        if   X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0]
        else                            : self.fill = X.mean()
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

To use it you would do:

# Make a series
s1 = pandas.Series(['k', 'i', 't', 't', 'e', numpy.NaN])


a  = SeriesImputer()   # Initialize the imputer
a.fit(s1)              # Fit the imputer
s2 = a.transform(s1)   # Get a new series 
Community
  • 1
  • 1
user1367204
  • 4,549
  • 10
  • 49
  • 78
0

You may also use OrdinalEncoder.

You can train it on the training set, using the function fit(), in order to obtain a model, then apply the model to both the training and test set, with transform():

oe = OrdinalEncoder()
# train the model on a training set of type pandas.DataFrame, for example
oe.fit(df_train)
# transform the training set using the model:
ar_train_encoded = oe.transform(df_train)
# transform the test set using the SAME model:
ar_test_encoded = oe.transform(df_test)

The result is a numpy array.

Catalina Chircu
  • 1,506
  • 2
  • 8
  • 19