My pipeline looks like this:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer
train_animals = pd.DataFrame({'animal': ['cat', 'dog', 'dog']})
lb = LabelBinarizer()
lb.fit_transform(train_animals.animal)
Which generates:
array([[0],
[1],
[1]])
However, when I apply my pipeline on unseen data:
test_animals = pd.DataFrame({'animal': ['cat', 'cat', 'duck', 'fish']})
lb.transform(test_animals)
It will spit out:
array([[1, 0],
[1, 0],
[0, 0],
[0, 0]])
Which breaks everything.
I need LabelBinarizer to ALWAYS onehotencode and never generate a single column. So:
lb = LabelBinarizer()
lb.fit_transform(train_animals.animal)
Will ideally generate:
array([[1, 0],
[0, 1],
[0, 1]])