I have a training set that I'm using to build some machine learning models and I need to set up some code to predict on a test set (that I don't have access to).
For instance, if I have a DataFrame, train
:
car
0 Audi
1 BMW
2 Mazda
I can use pd.get_dummies
to get:
car_Audi car_BMW car_Mazda
0 1 0 0
1 0 1 0
2 0 0 1
Call this resulting DataFrame, train_encoded
Now, suppose my test
DataFrame looks like:
car
0 Mercedes
I can use:
pd.get_dummies(test).reindex(columns=train_encoded.columns)
to get:
car_Audi car_BMW car_Mazda
0 0 0 0
How can I treat NaN
s the same as an unseen value for my car
column? That is, if I encounter NaN
in my car
column in in test
, I want to get back:
car_Audi car_BMW car_Mazda
0 0 0 0
Thanks!