2

it seems quick an ease to one-hot-encoding multiple categorical variables at once using get_dummies method, but how to remember which one is which so that my test data have the same feature as my training data? for example:

My training dataset has a CATEGORICAL feature:

   X
   cat
   dog
   lion
   lion

after get_dummies, I got something like this:

   X_1   X_2   X_3
    1     0     0
    0     1     0
    0     0     1
    0     0     1

after training model, I am ready to test my awesome magic model and here is the test data:

   X
   cat
   cat
   lion

if I apply the pd.get_dummies methods, I will get something like this:

   X_1      X_2
   1       0
   1       0
   0       1

which will be inconsistent with my train data features and i simply can't apply my model to the test data.

any suggestions so that I can get some like the following ?

   X_1   X_2   X_3
    1     0     0
    1     0     0
    0     0     1

How can I get a fit and transform functionality? again, I have over 50 categorical features and I can't apply LabelEncoder and then One_Hot_Encoder to them one by one.

Any suggestion? thank you.

user6396
  • 1,832
  • 6
  • 23
  • 38
  • 1
    Short version: define categories upfront and cast the dtype as category. Now when you call get_dummies pandas will generate columns for all categories even if they don't exist in that particular dataset. – ayhan Sep 06 '17 at 10:31
  • Oh wow... there's a dupe... and the solution is wonderful – cs95 Sep 06 '17 at 10:32
  • @ayhan, the answer in the pose you mentioned is quite convenient if there are handful of features, what if there are over 50 category features...any alternatives? – user6396 Sep 06 '17 at 11:31
  • You can use [LabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html) – Vivek Kumar Sep 06 '17 at 12:15
  • 1
    Also, one more thing you need to consider is the opposite scenario of what you described? In which test data has more categorical variables than the train. – Vivek Kumar Sep 06 '17 at 12:21
  • very good point. – user6396 Sep 07 '17 at 02:07

1 Answers1

0

I use get_dummies for all data, after that I split it into training and testing.

王晓晨
  • 336
  • 2
  • 13