How to one hot encode with pandas on a new dataset?

Question

I have a training data set that has categorical features on which I use pd.get_dummies to one hot encode. This produces a data set with n features. I then train a classification model on this data set with n features. If I now get some new data with the same categorical features and again perform one hot encoding, the resultant number of features is m < n.

I cannot predict the classes of the new data set if the dimensions don't match with the original training data.

Is there a way to include all of the original n features in the new data set after one hot encoding?

EDIT: I am using sklearn.ensemble.RandomForestClassifier as my classification library.

have a look on this https://stackoverflow.com/a/45365714/4683950 — Espoir Murhabazi, Mar 08 '18 at 19:27

score 2 · Accepted Answer · answered Mar 08 '18 at 19:27

2

For example ,

You have tradf with column ['A_1','A_2']

With your new df you have column['A'] but only have one category 1 , you can do

pd.get_dummies(df).reindex(columns=tradf.columns,fill_value=0)

answered Mar 08 '18 at 19:27

BENY

317,841
20
164
234

How to one hot encode with pandas on a new dataset?

1 Answers1

Linked