Ok I'm still having this issue and I'm at a loss as to where I'm going wrong. I thought I had a working solution, but I was wrong.
After finding a regression pipeline through TPOT, I go to use the .predict(X_test)
function and I get the following error message:
ValueError: Number of features of the model must match the input. Model n_features is 117 and input n_features is 118
I read somewhere on Github that XGBoost likes to have the X features passed to it in the form of a Numpy Array, and not a Pandas Dataframe. So I did that and now I receive this error message whenever a RandomForestRegressor ends up in my pipeline.
So I investigate:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed, shuffle=False)
# Here is where I convert the features to numpy arrays
X_train=X_train.values
X_test=X_test.values
print('[INFO] Printing the shapes of the training/testing feature/label sets...')
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
[INFO] Printing the shapes of the training/testing feature/label sets...
(1366, 117)
(456, 117)
(1366,)
(456,)
# Notice 117 rows for X columns...
# Now print the X_test shape just before the predict function...
print(X_test.shape)
(456, 117)
# Still 117 columns, so call predict:
predictions = best_model.predict(X_test)
ValueError: Number of features of the model must match the input. Model n_features is 117 and input n_features is 118
WHY!!!!!!?????
Now the tricky thing is, I'm using a custom tpot_config to only use the regressors XGBRegressor, ExtraTreesRegressor, GradientBoostingRegressor, AdaBoostRegressor, DecisionTreeRegressor, and RandomForestRegressor, so I need to come up with a way to train and predict the features whereby all of them will work with the data in the same way, so that no matter what pipeline it comes up with, I won't have this issue each time I go to run my code!
There have been similar questions asked at these links on SO:
... but I don't understand why my model is not predicting, when I AM passing it the same number of (X) features as was used in training the model!? Where am I going wrong here???
EDIT I should also mention, that leaving the features as dataframes and not converting them to numpy arrays sometimes gives me a "feature names mismatch" error when XGBRegressor is in the pipeline as well. So I'm at a loss as to how to handle both the list of tree regressors (which like Dataframes) and XGBoost (which likes Numpy arrays). I have also tried “re-arranging” the columns(?) to make sure that the X_train and X_test Dataframes are in the same order like some have suggested but that didn’t do anything.
I have posted my full code in a Google Colab notebook here where you can make comments on it. How can I pass the testing data to the .predict() function no matter what pipeline TPOT comes up with??????