How to deal with imputation and hot one encoding in pandas?

Question

I am trying to apply both imputation and hot one encoding on my data set. I know that on applying imputation, the dimension of data might change and so I took care of it manually. The model was working fine but then I decided to apply hot one encoding. And now, the program does not compile. Am am getting a dimension mismatch error.

test_X = pd.get_dummies(test)
train_X = pd.get_dummies(train)

col_with_missingVal = (col for col in train_X.columns if train_X[col].isnull().any())
for col in col_with_missingVal:
    train_X[col + 'is_missing'] = train_X[col].isnull()
    test_X[col + 'is_missing'] = test_X[col].isnull()

#impute the data
imputer = Imputer()
imp_train_X = pd.DataFrame(imputer.fit_transform(train_X))
imp_test_X = pd.DataFrame(imputer.fit_transform(test_X))
imp_train_X.columns = train_X.columns
imp_test_X.columns = test_X.columns

#Fit the model
my_model = RandomForestRegressor()
my_model.fit(imp_train_X, train_y)

# Use the model to make predictions
predicted_prices = my_model.predict(imp_test_X)

I am getting the following error on the last line of code:

ValueError: Number of features of the model must match the input. Model n_features is 293 and input n_features is 274

What is the reason for this error and how can this be fixed?

score 2 · Accepted Answer · answered May 21 '18 at 06:48

2

The problem is in first two line. pd.get_dummies() will return different columns for train and test if the data is different in them.

For example if in train, a column has 3 cateogories , 3 columns will be made for them, but it may happen that the test data only contains 2 categories in that specific column, in that you will get 2 columns after the pd.get_dummies(). Which then will lead to different number of columns.

There are a couple of things you can do here:

1) Easiest Use pd.get_dummies() on the whole data before train test split and then split the data. But its not recommended because it leaks the information of testing data to the model.

2) If you can use the development version of scikit, use CategoricalEncoder to perform the one hot encoding.

3) Use a combination of LabelEncoder + OneHotEncoder in the current scikit version to achieve the same. See my other answer for example.

Note

Also only call transform() on the test data, never fit(). Do this:-

# If you call fit_transform(), the imputer will again learn the 
# new mean from the test data
# Which will lead to differences and data leakage.
imp_test_X = pd.DataFrame(imputer.transform(test_X))

answered May 21 '18 at 06:48

Vivek Kumar

35,217
8
109
132

That's a very good answer. The only nuisance, which it does not cover explicitly is the inverse situation- when the testing dataset has more categories that the training set. In this case the error raised would be the same, i believe. And solutions 1) and 2) still will work, with the data leakage caveat that you have mentioned on 1). However, solution 3) breaks, as `LabelEncoder` can not handle new categories. – Mischa Lisovyi May 21 '18 at 09:12
@MykhailoLisovyi Thats because it should be so. Think about the real world scenario, where you encounter something new (new category not present in training). There you need to decide what to do? Either you can make a new category in the training (named as UNKNOWN) to accomodate the new labels. But then that column will be empty for all training so model cannot give more importance to that. Or during the prediction time, you leave those samples which contain such categories. And then after some time, you train the model with such data present. Hope this makes it clear. – Vivek Kumar May 21 '18 at 09:15
I believe, there is a third alternative, which is implemented in `CategoricalEncoder`, and that is: in the final OHE output put zeros into all training-known categories for training-unknown category. This might be possible, but not straightforward to implement with a `LabelEncoder+OneHotEncoder` sequence. – Mischa Lisovyi May 21 '18 at 09:23
1

@MykhailoLisovyi Are you talking about handle_unknown in CategoricalEncoder. Something like that is implement in [my answer here for LabelEncoder](https://stackoverflow.com/a/50056670/3374996) – Vivek Kumar May 21 '18 at 09:29
Yes, indeed, I was talking about `handle_unknown` argument of `CategoricalEncoder`. And yes, you solution in that other answer goes in that direction, I will try it out in the future :) I believe, it is useful to add a comment on handling such scenario you your option 3) with the link to your answer that you included in the comments – Mischa Lisovyi May 21 '18 at 09:36
@MykhailoLisovyi I purposefully did not add that in this answer lookig at the OP is a beginner. Once the OP gets familiar with the technique and the caveats, the question that you asked in the first comment should also arise from him. And then he can decide what to do. – Vivek Kumar May 21 '18 at 09:40
@VivekKumar Yeah you are absolutely right. I've been implementing all the algorithms using numpy from scratch as I am learning from andrew ng's deep learning course. I thought I should also start learning some beginner friendly library. What do you suggest, should I stick to scikit learn or move to pytorch or tensorflow ? – Ayush Chaurasia May 23 '18 at 14:01

score 0 · Answer 2 · answered May 14 '19 at 09:22

I've been struggling with a similar problem and I've found an approach that might help in this situation.

The main idea is to modify the type of the column to make it categorical when you are working with the complete dataset. Doing something like this:

dataframe[column] = dataframe[column].astype('category')

When you do that the dataframe's column will saved all the available categories. Later when you perform a train/test split of the data the categories will be saved even though the values might not be presented on one of the dataset.

Pandas get_dummies function uses the categories of the column to perform the encoding. Since the categories are stable you will always get the same amount of columns after encoding.

I'm exploring this solution myself. Keep in mind that you can manipulate the categories directly in case you need to. You can use something like this

dataframe[column].cat.set_categories([.....])

How to deal with imputation and hot one encoding in pandas?

2 Answers2

Linked