0

I tried to apply pandas get_dummies function to my dataset. The problem is category value's number is not matched train set and valid set. For example, train set column has 5 kind of values. ex : [1, 2, 3, 4, 5] However, valid set has just 3 kind of values. ex : [1, 3, 5]

When I made model by using train dataset there were 5 dummies is being created. ex: dum_1, dum_2, dum_3, dum_4, dum_5

So, if i just used same function for valid data set this will be made only 3 dummies will be created. ex: dum_1, dum_2, dum_3

It is not possible to predict valid data set to use my model. How to make same dummies for train and valid set? (It is not possible to concat 2 dataset. Please suggest another method except using pd.concat)

Also, if I add new column for valid set, I expect it will make different result. because dummies sequence is not matching between train and valid set.

thanks.

Jihoon Seo
  • 35
  • 3
  • Does this answer your question? [Keep same dummy variable in training and testing data](https://stackoverflow.com/questions/41335718/keep-same-dummy-variable-in-training-and-testing-data) – Ben Reiniger Jul 12 '21 at 13:06

1 Answers1

0

All you need to do is

  1. Create columns in the validation dataset which are present in the training data but missing in the validation data.
missing_cols = [col for col in train.columns if col not in valid.columns]
for col in missing_cols:
    valid[col] = 0
  1. Now, these columns are created in the end, so the order of the columns would be changed. Thus in the next step we would rearrange the columns as below:
valid = valid[[train.columns]]
paradocslover
  • 2,932
  • 3
  • 18
  • 44