Inconsistent LabelBinarizer Behaviour breaks Pipeline

Question

My pipeline looks like this:

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer

train_animals = pd.DataFrame({'animal': ['cat', 'dog', 'dog']})

lb = LabelBinarizer()
lb.fit_transform(train_animals.animal)

Which generates:

array([[0],
       [1],
       [1]])

However, when I apply my pipeline on unseen data:

test_animals = pd.DataFrame({'animal': ['cat', 'cat', 'duck', 'fish']})
lb.transform(test_animals)

It will spit out:

array([[1, 0],
       [1, 0],
       [0, 0],
       [0, 0]])

Which breaks everything.

I need LabelBinarizer to ALWAYS onehotencode and never generate a single column. So:

lb = LabelBinarizer()
lb.fit_transform(train_animals.animal)

Will ideally generate:

array([[1, 0],
       [0, 1],
       [0, 1]])

score 1 · Accepted Answer · answered Feb 22 '18 at 14:53

I think I've come up with a solution that hacks the internal label_binarize function and that works with DataFrameMapper

import pandas as pd
import numpy as np
from sklearn.preprocessing import label_binarize, LabelBinarizer
from sklearn.base import TransformerMixin
from sklearn_pandas import DataFrameMapper

class SafeLabelBinarizer(TransformerMixin):

    def __init__(self):
        self.lb = LabelBinarizer()

    def fit(self, X):
        X = np.array(X)
        self.lb.fit(X)
        self.classes_ = self.lb.classes_

    def transform(self, X):
        K = np.append(self.classes_, ['__FAKE__'])
        X = label_binarize(X, K, pos_label=1, neg_label=0)
        X = np.delete(X, np.s_[-1], axis=1)
        return X

    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

Training data:

train_animals = pd.DataFrame({'animal': ['cat', 'dog', 'dog']})

mapper = DataFrameMapper([
    ('animal', SafeLabelBinarizer())], df_out=True)

mapper.fit_transform(train_animals)

>>>

    animal_cat  animal_dog
0   1   0
1   0   1
2   0   1

Unseen data:

test_animals = pd.DataFrame({'animal': ['cat', 'cat', 'duck', 'fish']})
mapper.transform(test_animals)

>>>

    animal_cat  animal_dog
0   1   0
1   1   0
2   0   0
3   0   0

Vivek Kumar · Answer 2 · 2018-02-22T07:49:06.270

Its documented here that binary data will only contain 1 column.

Returns: Y : array or CSR matrix of shape [n_samples, n_classes]. Shape will be [n_samples, 1] for binary problems.

If you need one-column per category, you can try the following methods:

1) pd.get_dummies()

train_animals = pd.DataFrame({'animal': ['cat', 'dog', 'dog']})
pd.get_dummies(train_animals).values

array([[1, 0],
       [0, 1],
       [0, 1]])

But the caveat of this approach is that you need to transform the data before splitting into train and test. Not just on train data. Because on test data it will generate different number of columns.

2) CategoricalEncoder()

from sklearn.preprocessing import CategoricalEncoder
enc = CategoricalEncoder()
train_animals = pd.DataFrame({'animal': ['cat', 'dog', 'dog']})
enc.fit_tranform(train_animals[['animals']])

array([[1, 0],
       [0, 1],
       [0, 1]])

Now, the CategoricalEncoder is still in development branch, so may not be easy to use.

3) Instead of CategoricalEncoder, you can use the combination of LabelEncoder and OneHotEncoder. See my other answer for more details on usage:

https://stackoverflow.com/a/48079345/3374996

But for points 2 and 3, you need to make sure that all the possible values in the 'animals' column are present in train. If test set contains unseen values it will throw error, because the ML model cant do anything on test data which it hasn't seen.

Just a note: as of today [CategoricalEncoder](http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.CategoricalEncoder.html) is only available in the dev version of sklearn, but will be coming with the new release. — Marcus V., Feb 22 '18 at 07:37
@MarcusV. Yes. I was intending to edit the answer to add more details. Thanks — Vivek Kumar, Feb 22 '18 at 07:49

Inconsistent LabelBinarizer Behaviour breaks Pipeline

2 Answers2