4

I trained a classifier using Scikit-Learn. I am loading the input to train my classifier from a CSV. The value of some of my columns (e.g. 'Town') are canonical (e.g. can be 'New York', 'Paris', 'Stockholm', ...) . In order to use those canonical columns, I am doing one-hot encoding with the LabelBinarizer from Scikit-Learn.

This is how I transform data before training:

import pandas as pd
from sklearn.preprocessing import LabelBinarizer

headers = [ 
    'Ref.', 'Town' #,...
]

df = pd.read_csv("/path/to/some.csv", header=None, names=headers, na_values="?")

lb = LabelBinarizer()
lb_results = lb.fit_transform(df['Town'])

It is however not clear to me how to use the LabelBinarizer to create feature vectors using new input data for which I want to do predictions. Especially, if new data contains a seen town (eg New York) it needs to be encoded at the same place as the same town in the training data.

How is the Label Binarization supposed to be re-applied on new input data?

(I don't have a strong feeling on Scikit-Learn, if someone know how to do it with Pandas' get_dummies method that is fine too.)

charlesreid1
  • 4,360
  • 4
  • 30
  • 52
Pierre-Antoine
  • 7,939
  • 6
  • 28
  • 36

1 Answers1

4

Just use lb.transform() for already trained lb model.

Demo:

Assuming we have the following train DF:

In [250]: df
Out[250]:
           Town
0      New York
1        Munich
2          Kiev
3         Paris
4        Berlin
5      New York
6  Zaporizhzhia

Fit (train) & transform (binarize) in one step:

In [251]: r1 = pd.DataFrame(lb.fit_transform(df['Town']), columns=lb.classes_)

Yields:

In [252]: r1
Out[252]:
   Berlin  Kiev  Munich  New York  Paris  Zaporizhzhia
0       0     0       0         1      0             0
1       0     0       1         0      0             0
2       0     1       0         0      0             0
3       0     0       0         0      1             0
4       1     0       0         0      0             0
5       0     0       0         1      0             0
6       0     0       0         0      0             1

lb is trained now for those towns, that we had in the df

Now we can binarize new data sets using trained lb model (using lb.transform()):

In [253]: new
Out[253]:
       Town
0    Munich
1  New York
2     Dubai  # <--- new (not trained) town

In [254]: r2 = pd.DataFrame(lb.transform(new['Town']), columns=lb.classes_)

In [255]: r2
Out[255]:
   Berlin  Kiev  Munich  New York  Paris  Zaporizhzhia
0       0     0       1         0      0             0
1       0     0       0         1      0             0
2       0     0       0         0      0             0
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Thanks a lot. This is exactly what I was looking for. Would you also know a way to do the hot-encoding on several columns in the same time? Or do we need to do several hot-encoding and concatenate the resulting matrix/dataframes together? – Pierre-Antoine Oct 10 '17 at 20:10
  • 1
    @Pierre, you are welcome! You can use [MultiLabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) – MaxU - stand with Ukraine Oct 10 '17 at 20:14
  • Oh I see. I though MultiLabelBinarizer was something different. Thanks that is very helpful! Last but not least, no way to apply the hot encoding on some columns of the dataframe without loosing the other ones (and having to concatenate them back) right? – Pierre-Antoine Oct 10 '17 at 20:22
  • @Pierre, i'm not sure that i understood you correctly ... I'd suggest you to open a new question with a small reproducible data set and your desired data set - this will help to clearly understand what are you after... – MaxU - stand with Ukraine Oct 10 '17 at 20:25
  • Good point. I opened a new question here: https://stackoverflow.com/questions/46675870/doing-one-hot-encoding-in-several-columns-of-a-pandas-data-frames – Pierre-Antoine Oct 10 '17 at 20:56