I am trying to use a data set (KDD-cup-99) which has thousands of samples and around 41 features for one of my Machine Learning project. This is essentially packet captures of a particular network collected using TCP DUMP.
I used scikit-learn train_test_split function as below
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
After the split each of the above has the below shape.
X_train : (444618, 41)
y_train : (444618,)
X_test : (49403, 41)
y_test : (49403,)
Out of 41 features, 3 features are of string type. People who worked with this data can understand these. These three are: protocol_type, service, and flag.
I separated these three feature arrays from both train and test samples, did label encoding and 1hot encoding separately on both train and test samples. Now the size of the arrays of these three features are as below:
X_train_obj1: (444618, 3)
X_train_obj2: (444618, 65)
X_train_obj3: (444618, 11)
X_test_obj1: (49403, 3)
X_test_obj2: (49403, 64)
X_test_obj3: (49403, 11)
This is where I have the issue. For some reason, the train_obj2 has 65 features/columns where as test_obj2 array has 64 features/columns. This is causing issue in applying any fit/predict methods of standard algorithms like KNeighborClassifier, SVM etc later when merge these all back with respective train and test sets and start using them. The API fails with error indicating inconsistencies in size..
Corresponding code:
label_encoder = LabelEncoder()
train_proto_label_encoded =
label_encoder.fit_transform(X_train_obj['protocol_type'])
train_srv_label_encoded =
label_encoder.fit_transform(X_train_obj['service'])
train_flag_label_encoded =
label_encoder.fit_transform(X_train_obj['flag'])
test_proto_label_encoded =
label_encoder.fit_transform(X_test_obj['protocol_type'])
test_srv_label_encoded =
label_encoder.fit_transform(X_test_obj['service'])
test_flag_label_encoded =
label_encoder.fit_transform(X_test_obj['flag'])
hot_encoder = OneHotEncoder()
train_proto_1hot_encoded =
hot_encoder.fit_transform(train_proto_label_encoded.reshape(-1, 1))
train_srv_1hot_encoded =
hot_encoder.fit_transform(train_srv_label_encoded.reshape(-1, 1))
train_flag_1hot_encoded =
hot_encoder.fit_transform(train_flag_label_encoded.reshape(-1, 1))
test_proto_1hot_encoded =
hot_encoder.fit_transform(test_proto_label_encoded.reshape(-1, 1))
test_srv_1hot_encoded =
hot_encoder.fit_transform(test_srv_label_encoded.reshape(-1, 1))
test_flag_1hot_encoded =
hot_encoder.fit_transform(test_flag_label_encoded.reshape(-1, 1))
I did some debugging with print statements and essentially train set is getting samples with all 65 different types of services where as test set is getting samples only with 64 different types of services.
Can you help me in understanding and fixing this ?
1) Is this behavior expected when we do label encoding and 1_hot_encoding using scikit-learn APIs?
2) How to fix and make sure both train and test data sets will have all the services types or string types in this case ?
I can add the full code to the question if required.