How to predict if number of features are not matching with number of features available in testset?

Question

I am using pandas get_dummies to convert categorical variables into dummy/indicator variables, it introduce new features in the dataset. Then we fit/train this dataset into a model.

Since the dimension of X_train and X_test remains the same, when we do prediction for test data it works well with test data X_test.

Now lets say we have test data in another csv file (with unknown output). When we transform this set of test data using get_dummies, the resulting dataset may not have same number of features as we have trained our model with. Later when we use our model with this dataset its failing, because number of feature in testing set is not matching with the model's.

Any idea how we can handle this?

Code :

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load the dataset
in_file = 'train.csv'
full_data = pd.read_csv(in_file)
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)
features = pd.get_dummies(features_raw)
features = features.fillna(0.0)
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, 
test_size=0.2, random_state=42)
model = 
DecisionTreeClassifier(max_depth=50,min_samples_leaf=6,min_samples_split=2)
model.fit(X_train,y_train)

y_train_pred = model.predict(X_train)
#print (X_train.shape)
y_test_pred = model.predict(X_test)


from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

# DOing again to test another set of data
test_data = 'test.csv'
test_data1 = pd.read_csv(test_data)

test_data2 = pd.get_dummies(test_data1)
test_data3 = test_data2.fillna(0.0)
print(test_data2.shape)
print (model.predict(test_data3))

bamdan · Accepted Answer · 2018-08-29T09:17:43.013

Seems a similar question has been asked before but the most efficient/easiest way would be to follow approach by Thibault Clement described here

# Get missing columns in the training test
missing_cols = set( X_train.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X_test = X_test[X_train.columns]

It's also worth noting that your model can only use the features it was trained on so if there are additional columns in X_test vs X_train rather than less then these will have to be removed before predicting.

How to predict if number of features are not matching with number of features available in testset?

1 Answers1