4

I have 3 sets of data (training, validation and testing) and when I run:

    training_x = pd.get_dummies(training_x, columns=['a', 'b', 'c'])

It gives me a certain number of features. But then when I run it across validation data, it gives me a different number and the same for testing. Is there any way to normalize (wrong word, I know) across all data sets so the number of features aligns?

Shamoon
  • 41,293
  • 91
  • 306
  • 570

5 Answers5

7

As already statet, normally you should do one hot encoding before splitting. But there is another problem. One day you surely want to apply your trained ML model to data in the wild. I mean data, that you have not seen before and you need to do exactly the same transformation for the dummies, as when you trained the model. Then you could have to deal with two cases.

  1. is, that the new data contains categories that you did not have in your training data and
  2. is the other way round, that a category doesn't appear anymore in your dataset, but your model has been trained with it. In case 1. you should just ignore the value, since your model most likely can't deal with it not beeing trained on it. In case 2. you should still generate these empty categories to have the same structure in the data you want to predict as in your training set. Note, that the pandas method wouldn't generate dummies for these categories and thus cannot guarante that you get the same structure from your prediction data as you had in your training data and therefore most likely your model will not be applicable to the data.

You can address this by using the sklearn equivalent to get_dummies (with just a little more work), which looks like this:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# create some example data
df= pd.DataFrame({'x': [1, 2, 3], 'y': [2, 4, 8]})

# create a one hot encoder to create the dummies and fit it to the data
ohe= OneHotEncoder(handle_unknown='ignore', sparse=False)
ohe.fit(df[['x']])

# now let's simulate the two situations A and B
df.loc[1, 'x']= 1
df= df.append(dict(x=5, y=5), ignore_index=True)

# the actual feature generation is done in a separate step
tr=ohe.transform(df[['x']])

# if you need the columns in your existing data frame, you can glue them together
df2=pd.DataFrame(tr, columns=['oh1', 'oh2', 'oh3'], index=df.index)
result= pd.concat([df, df2], axis='columns')

With sklearn OneHotEncoder you can separate the identification of the categories from the actual one-hot-encoding (the creation of the dummies). And you could also save the fitted one hot encoder, to be able to apply it later during the application of your model. Note the handle_unknown option, which tells the one hot encoder, that in case it will encouter something unknown later, it should just ignore it, instead of raising an error.

jottbe
  • 4,228
  • 1
  • 15
  • 31
6

One simple solution is to align your validation and test sets to the training dataset after applying the dummies function. Here is how:

# Pandas encoding the data, dummies function creates different feature for each dataset
train = pd.get_dummies(train)
valid = pd.get_dummies(valid)
test = pd.get_dummies(test)

# Align the number of features across validation and test sets based on train dataset
train, valid = train.align(valid, join='left', axis=1)
train, test = train.align(test, join='left', axis=1)
Mahrokh
  • 93
  • 1
  • 5
  • When I attempt this, my prediction function is failing. Since the left join creates NaN values in the test set, the predict function throws and error. Ideas of work around? – Slyme Mar 15 '20 at 01:16
  • Does the test set have the same columns in the same order? If you copy the head of the train, valid, and test set here, we can take a look at it and investigate what is the problem. – Mahrokh Mar 15 '20 at 18:51
  • @Slyme when you are doing 'left' join if new columns are there in test data then NaNs are added .. So better do 'inner' join. train, valid = train.align(valid, join='inner', axis=1) train, test = train.align(test, join='inner', axis=1) – shantanu pathak Jun 06 '21 at 17:18
  • @Mahrokh , as slyme said above, the problem with 'left' join. I think you can edit your answer to include 'inner' join option as well for better coverage! – shantanu pathak Jun 06 '21 at 17:20
  • Pretty solution ! You should just add the additional step of replacing the NaN in the validation and test sets by 0: valid.fillna(0,inplace=True) test.fillna(0,inplace=True) – Lucie G Sep 12 '22 at 07:13
3

Referenced from kaggle : Link

Don't forget to add fill_value=0 to avoid NaN in test...

jacko
  • 31
  • 1
2

You can convert the datatype to category of the columns need to be converted to dummy variable

df.col_1=df.col_1.astype('category')
df1=df.iloc[:1,:].copy()
df2=df.drop(df1.index)
pd.get_dummies(df1,columns=['col_1'])
Out[701]: 
      col_2 col3  col_1_A  col_1_D  col_1_G  col_1_J
index                                               
0         B    C        1        0        0        0# it will show zero even missing in the sub-set
pd.get_dummies(df2,columns=['col_1'])
Out[702]: 
      col_2 col3  col_1_A  col_1_D  col_1_G  col_1_J
index                                               
1         E    F        0        1        0        0
2         H    I        0        0        1        0
3         K    L        0        0        0        1
BENY
  • 317,841
  • 20
  • 164
  • 234
1

dummies should be created before dividing the dataset into train, test or validate

suppose i have train and test dataframe as follows

import pandas as pd  
train = pd.DataFrame([1,2,3], columns= ['A'])
test= pd.DataFrame([7,8], columns= ['A'])

#creating dummy for train 
pd.get_dummies(train, columns= ['A'])

o/p
   A_1  A_2  A_3  A_4  A_5  A_6
0    1    0    0    0    0    0
1    0    1    0    0    0    0
2    0    0    1    0    0    0
3    0    0    0    1    0    0
4    0    0    0    0    1    0
5    0    0    0    0    0    1



# creating dummies for test data
pd.get_dummies(test, columns = ['A'])
    A_7  A_8
0    1    0
1    0    1

so dummy for 7 and 8 category will only be present in test and thus will result with different feature

final_df = pd.concat([train, test]) 

dummy_created = pd.get_dummies(final_df)

# now you can split it into train and test 
from sklearn.model_selection import train_test_split
train_x, test_x = train_test_split(dummy_created, test_size=0.33)

Now train and test will have same set of features

qaiser
  • 2,770
  • 2
  • 17
  • 29
  • I have `Y` values for my train and validation, but not for my test set. How would I handle that? – Shamoon Jun 24 '19 at 14:47
  • 6
    This is absolutely a data leakage scenario. You can never ever touch test data. In addition how you are going to merge data when you have a model already built ??? – Ehsan May 12 '20 at 06:50
  • 1
    I agree with Ehsan, this approach is not optimal. – Grzegorz Rut May 25 '21 at 19:29