One-hot encoding with categorial dataset: how to deal with different values (less number) in categorical data

Question

Training dataset total categorical columns: 27

Test dataset total categorical columns: 27

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_test = pd.DataFrame(OH_encoder.fit_transform(X_test[test_low_cardinality_cols]))

After Encoding, while preparing Test data for prediction,

number of columns from test data: 115

number of columns from train data: 122

I checked the cardinality in the test data, it is low for few columns compare to train data columns.

Train_data.column#1: 2
Test_data:column#1: 1

Train_data.column#2: 5
Test_data:column#2: 3
and more..

so automatically while one-hot encoding, the number of columns will be reduced. is there any better way to prepare the test data set without any data lose?

you applied one hot encoder separately for training and testing data? — Rajith Thennakoon, Nov 28 '19 at 02:49
Yes.. Training and Test data are different. so I did encoding for training, then evaluated. then I applied for test data. Else I need to match the values of each categorical columns, If matching, I can include those columns for one-hot encoding, else I should exclude those co lumns which is a data lose. — Subbu VidyaSekar, Nov 28 '19 at 02:51
once i had the same issue,what i did was add the missing columns with zeros to the testing set to bring back to the same shape. and to minimize the effect,i tried to keep the same column order of training set and testing set. — Rajith Thennakoon, Nov 28 '19 at 02:54
while encoding, the column names became 1,2,3,4... . so its difficult to compare with training columns — Subbu VidyaSekar, Nov 28 '19 at 02:57
add a prefix to the col name.. like something.1,something.2 ..somehow you need the same shape of testing set,try to add zero columns and check the performance.it is better to maintain the col order as well. — Rajith Thennakoon, Nov 28 '19 at 03:01

score 1 · Accepted Answer · answered Dec 01 '19 at 05:37

The ideal procedure would be fit the OneHotEncoder in training data and then do a transform in test data. By this way, you will get a consistent number of columns in train and test data.

Something like the following:

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_encoder.fit(X_train)

OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test))

To understand the column name of the output of OneHotEncoder use get_feature_names method. Probably this answer might help.

One-hot encoding with categorial dataset: how to deal with different values (less number) in categorical data

1 Answers1