11

I'm currently exploring the scikit learn pipelines. I also want to preprocess the data with a pipeline. However, my train and test data have different levels of the categorical variable. Example: Consider:

import pandas as pd
train = pd.Series(list('abbaa'))
test = pd.Series(list('abcd'))

I wrote a TransformerMixinClass using pandas

class CreateDummies(TransformerMixin):

def transform(self, X, **transformparams):
    return pd.get_dummies(X).copy()

def fit(self, X, y=None, **fitparams):
    return self

fit_transform yields for the train data 2 columns and for the test data 4 columns. So no surprise here, but not suitable for a pipeline

Similary, I tried to import the label encoder (and OneHotEncoder for the potential next steps):

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
le.fit_transform(train)
le.transform(test)

which yields, not surprisingly, an error.

So the problem here is that I need some information contained in the test set. Is there a good way to include this in a pipeline?

Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42
  • 1
    Can you get_dummies before you split train and test? – piRSquared Oct 01 '16 at 09:12
  • The data I have a from a kaggle competition split into train and test. But of course I could do this by simply concatenating those sets (the test set also has nans in different columns than the train set). I also fear that I have to do a pre preprocessing step here, I'm not yet sure if I like this ;) – Quickbeam2k1 Oct 01 '16 at 09:24

1 Answers1

10

You can use categoricals as explained in this answer:

categories = np.union1d(train, test)
train = train.astype('category', categories=categories)
test = test.astype('category', categories=categories)

pd.get_dummies(train)
Out: 
   a  b  c  d
0  1  0  0  0
1  0  1  0  0
2  0  1  0  0
3  1  0  0  0
4  1  0  0  0

pd.get_dummies(test)
Out: 
   a  b  c  d
0  1  0  0  0
1  0  1  0  0
2  0  0  1  0
3  0  0  0  1
Community
  • 1
  • 1
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • 2
    Hey, thanks for the great answer: For a more than one dimensional dataframe one should use ```train.apply(lambda x:x.astype('category', categories=categories), axis=0)```. Firstly, I was worried about the union1d function, since levels over all categories are caught. But this is not an issue. Do you see a way to avoid the apply here? – Quickbeam2k1 Oct 01 '16 at 12:37
  • 1
    `astype('category')` only works for one dimensional arrays right now so either `apply` or an explicit for loop seems necessary. – ayhan Oct 01 '16 at 12:48