Keep same dummy variable in training and testing data

Question

I am building a prediction model in python with two separate training and testing sets. The training data contains numerical type categorical variable, e.g., zip code,[91521,23151,12355, ...], and also string categorical variables, e.g., city ['Chicago', 'New York', 'Los Angeles', ...].

To train the data, I first use the 'pd.get_dummies' to get dummy variable of these variable, and then fit the model with the transformed training data.

I do the same transformation on my test data and predict the result using the trained model. However, I got the error

ValueError: Number of features of the model must  match the input. Model n_features is 1487 and  input n_features is 1345

The reason is because there are fewer dummy variables in the test data because it has fewer 'city' and 'zipcode'.

How can I solve this problem? For example, 'OneHotEncoder' will only encode all numerical type categorical variable. 'DictVectorizer()' will only encode all string type categorical variable. I search on line and see a few similar questions but none of them really addresses my question.

Handling categorical features using scikit-learn

https://www.quora.com/If-the-training-dataset-has-more-variables-than-the-test-dataset-what-does-one-do

https://www.quora.com/What-is-the-best-way-to-do-a-binary-one-hot-one-of-K-coding-in-Python

score 81 · Answer 1 · answered Jul 28 '17 at 04:59

81

You can also just get the missing columns and add them to the test dataset:

# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]

This code also ensure that column resulting from category in the test dataset but not present in the training dataset will be removed

answered Jul 28 '17 at 04:59

Thibault Clement

2,360
2
13
17

6

Instead of the last line, you can also use `train, test = train.align(test, axis=1)` – Ammar Alyousfi Nov 07 '18 at 07:30
8

if the training is done using get dummies and model is saved, later we load the model and a new test data(only one record), at that point how to get the new columns names in the test data based on its values – hanzgs Jun 07 '19 at 01:06

Eduard Ilyasov · Answer 2 · 2016-12-27T06:50:01.657

27

Assume you have identical feature's names in train and test dataset. You can generate concatenated dataset from train and test, get dummies from concatenated dataset and split it to train and test back.

You can do it this way:

import pandas as pd
train = pd.DataFrame(data = [['a', 123, 'ab'], ['b', 234, 'bc']],
                     columns=['col1', 'col2', 'col3'])
test = pd.DataFrame(data = [['c', 345, 'ab'], ['b', 456, 'ab']],
                     columns=['col1', 'col2', 'col3'])
train_objs_num = len(train)
dataset = pd.concat(objs=[train, test], axis=0)
dataset_preprocessed = pd.get_dummies(dataset)
train_preprocessed = dataset_preprocessed[:train_objs_num]
test_preprocessed = dataset_preprocessed[train_objs_num:]

In result, you have equal number of features for train and test dataset.

edited Dec 27 '16 at 06:50

answered Dec 27 '16 at 04:34

Eduard Ilyasov

3,268
2
20
18

24

What about unseen test data ? Concatenate and retrain model ? Doesn't seem like a viable option – randomSampling Aug 31 '18 at 07:06
1

@randomSampling have you found a solution for this? If yes, could you please take a look at this [question](https://stackoverflow.com/questions/64910582/can-we-make-the-ml-model-pickle-file-more-robust-by-accepting-or-ignoring-n?noredirect=1#comment114761689_64910582) – R overflow Nov 20 '20 at 13:58

score 22 · Answer 3 · answered Nov 11 '17 at 16:50

22

train2,test2 = train.align(test, join='outer', axis=1, fill_value=0)

train2 and test2 have the same columns. Fill_value indicates the value to use for missing columns.

answered Nov 11 '17 at 16:50

user1482030

777
11
23

In train data, if column name is "Marital_Status", it becomes "Marital_Status_Single, Marital_Status_Married, Marital_Status_Divorced", but in test data it is still "Marital_Status" and say the values is "Single", so how to impute exact column "Marital_Status_Single" to 1 and other 2 to 0. – hanzgs Jun 12 '19 at 02:22
1

@hanzgs, its very late but for others help::- Before performing train-test join, perform one-hot encoding for test data as well "pd.get_dummies(test))" – rmswrp May 15 '21 at 20:14

score 6 · Answer 4 · answered Jun 10 '20 at 12:32

I have this in the past after having run get_dummies on both train and test sets

X_test = X_test.reindex(columns = X_train.columns, fill_value=0)

Obviously a little tweaking for the individual case. But, it throws away novel values in the test set and values missing from the test are filled in, in this case with all zeros.

score 4 · Answer 5 · answered Aug 03 '18 at 07:47

This is a rather old question, but if you aim at using scikit learn API, you can use the following DummyEncoder class: https://gist.github.com/psinger/ef4592492dc8edf101130f0bf32f5ff9

What it does is that it utilizes the category dtype to specify which dummies to create as also elaborated here: Dummy creation in pipeline with different levels in train and test set

Mattravel · Answer 6 · 2023-03-07T06:39:22.993

For sklearn >= 0.20, OneHotEncoder can now encode string data.

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

X_train = pd.DataFrame({
    'zip' : [23151, 12355],
    'city' : ['New York', 'Los Angeles']
})

X_test = pd.DataFrame({
    'zip' : [91521, 23151],
    'city' : ['Chicago', 'New York']
})

ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False) # New in version 1.2: sparse was renamed to sparse_output
X_train_ohe = ohe.fit_transform(X_train)
X_test_ohe = ohe.transform(X_test)

To get a clean Dataframe with the corresponding column names (similar to pd.get_dummies) :

cols_ohe = ohe.get_feature_names_out()
X_train_ohe = pd.DataFrame(X_train_ohe, columns=cols_ohe)
X_test_ohe = pd.DataFrame(X_test_ohe, columns=cols_ohe)

>>> X_train_ohe 
zip_12355   zip_23151   city_Los Angeles    city_New York
0.0         1.0         0.0                 1.0
1.0         0.0         1.0                 0.0

>>> X_test_ohe 
zip_12355   zip_23151   city_Los Angeles    city_New York
0.0         0.0         0.0                 0.0
0.0         1.0         0.0                 1.0

score 0 · Answer 7 · answered May 10 '23 at 10:57

0

Convert zip code to str

use fit_transform() for training data and transform() for testing data in OneHotEncoder

answered May 10 '23 at 10:57

Gokul Patel

1

2

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 10 '23 at 12:21

Keep same dummy variable in training and testing data

7 Answers7

Linked

Related