Say I have the following data
import pandas as pd
data = {
'Reference': [1, 2, 3, 4, 5],
'Brand': ['Volkswagen', 'Volvo', 'Volvo', 'Audi', 'Volkswagen'],
'Town': ['Berlin', 'Berlin', 'Stockholm', 'Munich', 'Berlin'],
'Mileage': [35000, 45000, 121000, 35000, 181000],
'Year': [2015, 2014, 2012, 2016, 2013]
}
df = pd.DataFrame(data)
On which I would like to do one-hot encoding on the two columns "Brand" and "Town" in order to train a classifier (say with Scikit-Learn) and predict the year.
Once the classifier is trained I will want to predict the year on new incoming data (not use in the training), where I will need to re-apply the same hot encoding. For example:
new_data = {
'Reference': [6, 7],
'Brand': ['Volvo', 'Audi'],
'Town': ['Stockholm', 'Munich']
}
In this context, what is the best way to do one-hot encoding of the 2 columns on the Pandas DataFrame knowing that there is a need to encode several columns, and that there is a need to be able to apply the same encoding on new data later.
This is a follow up question of How to re-use LabelBinarizer for input prediction in SkLearn