I have a training data set that has categorical features on which I use pd.get_dummies
to one hot encode. This produces a data set with n features. I then train a classification model on this data set with n features. If I now get some new data with the same categorical features and again perform one hot encoding, the resultant number of features is m < n.
I cannot predict the classes of the new data set if the dimensions don't match with the original training data.
Is there a way to include all of the original n features in the new data set after one hot encoding?
EDIT: I am using sklearn.ensemble.RandomForestClassifier as my classification library.