1

I have a training data set that has categorical features on which I use pd.get_dummies to one hot encode. This produces a data set with n features. I then train a classification model on this data set with n features. If I now get some new data with the same categorical features and again perform one hot encoding, the resultant number of features is m < n.

I cannot predict the classes of the new data set if the dimensions don't match with the original training data.

Is there a way to include all of the original n features in the new data set after one hot encoding?

EDIT: I am using sklearn.ensemble.RandomForestClassifier as my classification library.

PyRsquared
  • 6,970
  • 11
  • 50
  • 86

1 Answers1

2

For example ,

You have tradf with column ['A_1','A_2']

With your new df you have column['A'] but only have one category 1 , you can do

pd.get_dummies(df).reindex(columns=tradf.columns,fill_value=0)
BENY
  • 317,841
  • 20
  • 164
  • 234