Scikit-learn Label Encoding followed by one hot encoding resulting in different feature set for train and test data sets. How to fix this?

Question

I am trying to use a data set (KDD-cup-99) which has thousands of samples and around 41 features for one of my Machine Learning project. This is essentially packet captures of a particular network collected using TCP DUMP.

I used scikit-learn train_test_split function as below

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
random_state=42)

After the split each of the above has the below shape.

X_train : (444618, 41)
y_train :  (444618,)
X_test  : (49403, 41)
y_test  :  (49403,)

Out of 41 features, 3 features are of string type. People who worked with this data can understand these. These three are: protocol_type, service, and flag.

I separated these three feature arrays from both train and test samples, did label encoding and 1hot encoding separately on both train and test samples. Now the size of the arrays of these three features are as below:

X_train_obj1: (444618, 3)
X_train_obj2: (444618, 65)
X_train_obj3: (444618, 11)

X_test_obj1: (49403, 3)
X_test_obj2: (49403, 64)
X_test_obj3: (49403, 11)

This is where I have the issue. For some reason, the train_obj2 has 65 features/columns where as test_obj2 array has 64 features/columns. This is causing issue in applying any fit/predict methods of standard algorithms like KNeighborClassifier, SVM etc later when merge these all back with respective train and test sets and start using them. The API fails with error indicating inconsistencies in size..

Corresponding code:

label_encoder = LabelEncoder()
train_proto_label_encoded = 
label_encoder.fit_transform(X_train_obj['protocol_type'])
train_srv_label_encoded = 
label_encoder.fit_transform(X_train_obj['service'])
train_flag_label_encoded = 
label_encoder.fit_transform(X_train_obj['flag'])
test_proto_label_encoded = 
label_encoder.fit_transform(X_test_obj['protocol_type'])
test_srv_label_encoded = 
label_encoder.fit_transform(X_test_obj['service'])
test_flag_label_encoded = 
label_encoder.fit_transform(X_test_obj['flag'])

hot_encoder = OneHotEncoder()
train_proto_1hot_encoded = 
hot_encoder.fit_transform(train_proto_label_encoded.reshape(-1, 1))
train_srv_1hot_encoded = 
hot_encoder.fit_transform(train_srv_label_encoded.reshape(-1, 1))
train_flag_1hot_encoded = 
hot_encoder.fit_transform(train_flag_label_encoded.reshape(-1, 1))
test_proto_1hot_encoded = 
hot_encoder.fit_transform(test_proto_label_encoded.reshape(-1, 1))
test_srv_1hot_encoded = 
hot_encoder.fit_transform(test_srv_label_encoded.reshape(-1, 1))
test_flag_1hot_encoded = 
hot_encoder.fit_transform(test_flag_label_encoded.reshape(-1, 1))

I did some debugging with print statements and essentially train set is getting samples with all 65 different types of services where as test set is getting samples only with 64 different types of services.

Can you help me in understanding and fixing this ?

1) Is this behavior expected when we do label encoding and 1_hot_encoding using scikit-learn APIs?

2) How to fix and make sure both train and test data sets will have all the services types or string types in this case ?

I can add the full code to the question if required.

You never call `fit` (or `fit_transform`) on test data. Only call `transform()`. Doing fit on test will only remember those values which are present in test, and hence the change in number of columns. — Vivek Kumar, Jul 31 '18 at 06:40
I am not calling fit on test data. The issue shows up when I call predict on test data — user2896235, Jul 31 '18 at 06:43
You are calling fit_transform from LabelEncoder and OnehotEncoder on test data. You will need to maintain individual `LabelEncoder`s for each of your column and then for test data call transform() from same labelencoder objects. — Vivek Kumar, Jul 31 '18 at 06:44
Look at the code from my other answer here as to what needs to be done: https://stackoverflow.com/a/48079345/3374996 — Vivek Kumar, Jul 31 '18 at 06:46
Vivek, Thanks for your inputs. They were useful. Above link helped me in understanding on what is wrong, however, I still faced one issue of some labels not present in training data but present only in test data. Gabriel's answer helped me in addressing that issue. — user2896235, Aug 01 '18 at 18:00

score 1 · Accepted Answer · answered Jul 31 '18 at 15:53

1) Is this behavior expected when we do label encoding and 1_hot_encoding using scikit-learn APIs?

Yes, but it is because you are using fit_transform the wrong way

2) How to fix and make sure both train and test data sets will have all the services types or string types in this case ?

If not all the caregories on train are on test and vice-versa you must fit the encoder on both train and test data. So that the encoder takes into account all the categories.

Once the encoder is fit you can call transform separately on both train and test. You'll get thé same number of features.

There is one last detail. If you have categories which are only on test, training a model using these categories may lead to some unexpected behaviour.

Scikit-learn Label Encoding followed by one hot encoding resulting in different feature set for train and test data sets. How to fix this?

1 Answers1