pandas get_dummies how to remember which value become which new category?

Question

it seems quick an ease to one-hot-encoding multiple categorical variables at once using get_dummies method, but how to remember which one is which so that my test data have the same feature as my training data? for example:

My training dataset has a CATEGORICAL feature:

   X
   cat
   dog
   lion
   lion

after get_dummies, I got something like this:

   X_1   X_2   X_3
    1     0     0
    0     1     0
    0     0     1
    0     0     1

after training model, I am ready to test my awesome magic model and here is the test data:

   X
   cat
   cat
   lion

if I apply the pd.get_dummies methods, I will get something like this:

   X_1      X_2
   1       0
   1       0
   0       1

which will be inconsistent with my train data features and i simply can't apply my model to the test data.

any suggestions so that I can get some like the following ?

   X_1   X_2   X_3
    1     0     0
    1     0     0
    0     0     1

How can I get a fit and transform functionality? again, I have over 50 categorical features and I can't apply LabelEncoder and then One_Hot_Encoder to them one by one.

Any suggestion? thank you.

Short version: define categories upfront and cast the dtype as category. Now when you call get_dummies pandas will generate columns for all categories even if they don't exist in that particular dataset. — ayhan, Sep 06 '17 at 10:31
@ayhan, the answer in the pose you mentioned is quite convenient if there are handful of features, what if there are over 50 category features...any alternatives? — user6396, Sep 06 '17 at 11:31
You can use [LabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html) — Vivek Kumar, Sep 06 '17 at 12:15
Also, one more thing you need to consider is the opposite scenario of what you described? In which test data has more categorical variables than the train. — Vivek Kumar, Sep 06 '17 at 12:21

score 0 · Answer 1 · answered Sep 06 '17 at 10:29

0

I use get_dummies for all data, after that I split it into training and testing.

answered Sep 06 '17 at 10:29

王晓晨

336
2
13

1

Sometimes this is not a feasible option... then what? – cs95 Sep 06 '17 at 10:33
2

nope, won't work in real world – user6396 Sep 06 '17 at 11:23

pandas get_dummies how to remember which value become which new category?

1 Answers1