I trained a classifier using Scikit-Learn. I am loading the input to train my classifier from a CSV. The value of some of my columns (e.g. 'Town') are canonical (e.g. can be 'New York', 'Paris', 'Stockholm', ...) . In order to use those canonical columns, I am doing one-hot encoding with the LabelBinarizer from Scikit-Learn.
This is how I transform data before training:
import pandas as pd
from sklearn.preprocessing import LabelBinarizer
headers = [
'Ref.', 'Town' #,...
]
df = pd.read_csv("/path/to/some.csv", header=None, names=headers, na_values="?")
lb = LabelBinarizer()
lb_results = lb.fit_transform(df['Town'])
It is however not clear to me how to use the LabelBinarizer to create feature vectors using new input data for which I want to do predictions. Especially, if new data contains a seen town (eg New York) it needs to be encoded at the same place as the same town in the training data.
How is the Label Binarization supposed to be re-applied on new input data?
(I don't have a strong feeling on Scikit-Learn, if someone know how to do it with Pandas' get_dummies method that is fine too.)