1

I'm trying to encode categorical values to dummy vectors. pandas.get_dummies does a perfect job, but the dummy vectors depend on the values present in the Dataframe. How to encode a second Dataframe according to the same dummy vectors as the first Dataframe?

 import pandas as pd


df=pd.DataFrame({'cat1':['A','N','K','P'],'cat2':['C','S','T','B']})
b=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(b)



  cat1_A  cat1_K  cat1_N  cat1_P
0       1       0       0       0
1       0       0       1       0
2       0       1       0       0
3       0       0       0       1



df_test=df=pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
c=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(c)

   cat1_A  cat1_N
0       1       0
1       0       1

How can I get this output ?

 cat1_A  cat1_K  cat1_N  cat1_P
0       1       0       0       0
1       0       0       1       0

I was thinking to manually compute uniques for each column and then create a dictionary to map the second Dataframe, but I'm sure there is already a function for that... Thanks!

user375348
  • 759
  • 1
  • 6
  • 23

2 Answers2

4

A always use categorical_encoding because it has a great choice of encoders. It also works with Pandas very nicely, is pip installable and is written inline with the sklearn API. It means you can quick test different types of encoders with the fit and transform methods or in a Pipeline.

If you wish to encode just the first column, like in your example, we can do so.

import pandas as pd
import category_encoders as ce

df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})
enc_ohe = ce.one_hot.OneHotEncoder(cols=['cat1'])
# cols=None, all string columns encoded

df_trans = enc_ohe.fit_transform(df)
print(df_trans)

   cat1_0  cat1_1  cat1_2  cat1_3 cat2
0       0       1       0       0    C
1       0       0       0       1    S
2       1       0       0       0    T
3       0       0       1       0    B

The default is to have column names have numerical encoding instead of the original letters. This is helpful though when you have long strings as categories. This can be changed by passing the use_cat_names=True kwarg, as mentioned by Arthur.

Now we can use the transform method to encode your second DataFrame.

df_test = pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
df_test_trans = enc_ohe.transform(df_test)

print(df_test_trans)

   cat1_1  cat1_3 cat2
0       1       0    T
1       0       1    B

As commented in line 5, not setting cols defaults to encode all string columns.

Little Bobby Tables
  • 4,466
  • 4
  • 29
  • 46
  • This doesn't seem to work if there are values for a category in the first data frame that aren't in the second. – michaelsnowden Mar 22 '17 at 23:48
  • @michaelsnowden By first and second dataframes do you mean train and test dataframes? – Little Bobby Tables Mar 23 '17 at 14:47
  • Yes, in this case. More generally, by "first one", I mean the one you call fit_transform on, and by "second one" I mean the one you call transform on. – michaelsnowden Mar 23 '17 at 17:43
  • That has certainly worked for me in this case before. The opposite has not worked for me; when the there are values in the test (second) set that were not in the first set. The solution is a `pd.concat([train, test])` then `fit_transform`. – Little Bobby Tables Mar 23 '17 at 23:31
  • 1
    @josh The current version has a `use_cat_names` option that appends the values of the features to the column names (http://contrib.scikit-learn.org/categorical-encoding/onehot.html). – Arthur Azevedo De Amorim Aug 01 '18 at 14:20
1

I had the same problem before. This is what I did which is not necessary the best way to do this. But this works for me.

df=pd.DataFrame({'cat1':['A','N'],'cat2':['C','S']})

df['cat1'] = df['cat1'].astype('category', categories=['A','N','K','P'])
# then run the get_dummies
b=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')

Using the function astype with 'categories' values passed in as parameter.

To apply the same category to all DFs, you better store the category values to a variable like

cat1_categories = ['A','N','K','P']
cat2_categories = ['C','S','T','B']

Then use astype like

df_test=df=pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
df['cat1'] = df['cat1'].astype('category', categories=cat1_categories)
c=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(c)

   cat1_A  cat1_N  cat1_K  cat1_P
0       1       0       0       0
1       0       1       0       0
Peng
  • 434
  • 5
  • 14
  • Thanks but how do you encode the second DF the same way? – user375348 Oct 11 '16 at 15:17
  • Will do the same. df['cat1'] = df['cat1'].astype('category', categories=['A','N','K','P']) The critical part is having the 'categories' parameter the same. You need have a list of category values before hand. It can come from the first DF or somewhere else. – Peng Oct 11 '16 at 15:28
  • I edited my answer. I hope this clarify some confuses. Please let me know if you have more questions. – Peng Oct 11 '16 at 15:37
  • Clear. Also hardcoding the categories helps you handle the case where new categories show up in the test data, that were not present in the training data – user375348 Oct 11 '16 at 18:46