I am trying to apply both imputation and hot one encoding on my data set. I know that on applying imputation, the dimension of data might change and so I took care of it manually. The model was working fine but then I decided to apply hot one encoding. And now, the program does not compile. Am am getting a dimension mismatch error.
test_X = pd.get_dummies(test)
train_X = pd.get_dummies(train)
col_with_missingVal = (col for col in train_X.columns if train_X[col].isnull().any())
for col in col_with_missingVal:
train_X[col + 'is_missing'] = train_X[col].isnull()
test_X[col + 'is_missing'] = test_X[col].isnull()
#impute the data
imputer = Imputer()
imp_train_X = pd.DataFrame(imputer.fit_transform(train_X))
imp_test_X = pd.DataFrame(imputer.fit_transform(test_X))
imp_train_X.columns = train_X.columns
imp_test_X.columns = test_X.columns
#Fit the model
my_model = RandomForestRegressor()
my_model.fit(imp_train_X, train_y)
# Use the model to make predictions
predicted_prices = my_model.predict(imp_test_X)
I am getting the following error on the last line of code:
ValueError: Number of features of the model must match the input. Model n_features is 293 and input n_features is 274
What is the reason for this error and how can this be fixed?