0

Ok I'm still having this issue and I'm at a loss as to where I'm going wrong. I thought I had a working solution, but I was wrong.

After finding a regression pipeline through TPOT, I go to use the .predict(X_test) function and I get the following error message:

ValueError: Number of features of the model must match the input. Model n_features is 117 and input n_features is 118

I read somewhere on Github that XGBoost likes to have the X features passed to it in the form of a Numpy Array, and not a Pandas Dataframe. So I did that and now I receive this error message whenever a RandomForestRegressor ends up in my pipeline.

So I investigate:

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed, shuffle=False)

# Here is where I convert the features to numpy arrays
X_train=X_train.values
X_test=X_test.values

print('[INFO] Printing the shapes of the training/testing feature/label sets...')
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

    [INFO] Printing the shapes of the training/testing feature/label sets...
    (1366, 117)
    (456, 117)
    (1366,)
    (456,)
# Notice 117 rows for X columns...

# Now print the X_test shape just before the predict function...
print(X_test.shape)

    (456, 117)
# Still 117 columns, so call predict:

predictions = best_model.predict(X_test)

    ValueError: Number of features of the model must match the input. Model n_features is 117 and input n_features is 118

WHY!!!!!!?????

Now the tricky thing is, I'm using a custom tpot_config to only use the regressors XGBRegressor, ExtraTreesRegressor, GradientBoostingRegressor, AdaBoostRegressor, DecisionTreeRegressor, and RandomForestRegressor, so I need to come up with a way to train and predict the features whereby all of them will work with the data in the same way, so that no matter what pipeline it comes up with, I won't have this issue each time I go to run my code!

There have been similar questions asked at these links on SO:

Here

Here

Here

Here

... but I don't understand why my model is not predicting, when I AM passing it the same number of (X) features as was used in training the model!? Where am I going wrong here???

EDIT I should also mention, that leaving the features as dataframes and not converting them to numpy arrays sometimes gives me a "feature names mismatch" error when XGBRegressor is in the pipeline as well. So I'm at a loss as to how to handle both the list of tree regressors (which like Dataframes) and XGBoost (which likes Numpy arrays). I have also tried “re-arranging” the columns(?) to make sure that the X_train and X_test Dataframes are in the same order like some have suggested but that didn’t do anything.

I have posted my full code in a Google Colab notebook here where you can make comments on it. How can I pass the testing data to the .predict() function no matter what pipeline TPOT comes up with??????

wildcat89
  • 1,159
  • 16
  • 47

1 Answers1

0

Thanks to weixuanfu at GitHub, I may have found a solution by moving the feature_importance code section down to the bottom of the my code, and yes using numpy arrays for the features. If I run into this issue again, I will be posting it below:

https://github.com/EpistasisLab/tpot/issues/738

wildcat89
  • 1,159
  • 16
  • 47