How to properly select the best model in GridSearchCV - both sklearn and caret do it wrong

Question

Consider 3 data sets train/val/test. sklearn's GridSearchCV() by default chooses the best model with the highest cross-validation score. In a real-world setting where the predictions need to be accurate, this is a horrible approach to choosing the best model. The reason is that this is how it's supposed to be used:

Train set for the model to learn the dataset
Val set to validate what the model has learned in the train set and update parameters/hyperparameters to maximize the validation score.
Test set - to test your data on unseen data.
Finally, use the model in a live setting and log the results to see if the results are good enough to make decisions. It's surprising that many data scientists impulsively use their trained model in production based only on selecting the model with the highest validation score. I find gridsearch to choose models that are painfully overfitting and do a worse job at predicting unseen data than the default parameters.

My approaches:

Manually train the models and look at the results for each model (in a sort of a loop, but not very efficient). It's very manual and time-consuming, but I get significantly better results than gridsearch. I want this to be completely automated.
Plot the validation curve for each hyperparameter I want to choose, and then pick the hyperparameter that shows the smallest difference between train and val set while maximizing both (i.e. train=98%, val = 78% is really bad, but train=72%, val=70% is acceptable).

As I said, I want a better (automated) method for choosing the best model.

What kind of answer I'm looking for:

I want to maximize the score in the train and validation set, while minimizing the score difference between the train and val sets. Consider the following example from a gridsearch algorithm: There are two models:

Model A: train score = 99%, val score = 89%
Model B: train score = 80%, val score = 79%

Model B is a much more reliable model, and I would choose Model B over Model A any day. It is less overfit, and the predictions are consistent. We know what to expect. However, gridsearch will choose model A since the val score is higher. I find this to be a common problem and haven't found any solution anywhere on the internet. People tend to be so focused on what they learn in school and don't actually think about the consequences of choosing an overfit model. I see redundant posts about how to use gridsearch from sklearn and caret packages and have them choose the model for you, but not how to actually choose the best model.

My approach so far has been very manual. I want an automated way of doing this.

What I do currently is this:

gs = GridSearchCV(model, params, cv=3).fit(X_train, y_train) # X_train and y_train consists of validation sets too if you do it this way, since GridSearchCV already creates a cv set.
final_model = gs.best_estimator_
train_predictions = final_model.predict(X_train)
val_predictions = final_model.predict(X_val)
test_predictions = final_model.predict(X_test)

print('Train Score:', accuracy_score(train_predictions, y_train)) # .99
print('Val Score:', accuracy_score(val_predictions, y_val)) # .89
print('Test Score:', accuracy_score(test_predictions, y_test)) # .8

If I see something like the above, I'll rule out that model and try different hyperparameters until I get consistent results. By manually fitting different models and looking at all 3 of these results, the validation curves, etc... I can decide what the best model is. I don't want to do this manually. I want this process to be automated. The gridsearch algorithm returns overfit models every time. I look forward to hearing some answers.

Another big issue is the difference between val and test sets. Since many problems face a time dependency issue, I'd like to know a reliable way to test the models' performance as time goes on. It's crucial to split the data set by time; otherwise, we are presenting data leakage. One method I'm familiar with is discriminative analysis (fitting a model to see if the model can predict which dataset the example came from: train val test). Another method is KS / KL tests and looking at the distribution of the target variable, or looping through each feature and comparing the distribution.

I thought the point of having different validation and test sets is to _not_ choose your hyperparameters based on performance on your test set. — Axeman, Oct 31 '19 at 17:01
Getting results like `Model A: train score = 99%, val score = 98%, test set = 80%` to me screams that the validation set isn't being generated correctly. Is there a time or space dependency? If so, then my guess would be that the validation is being done in time while the test set is out of time. As soon as you start using your test set to choose a model, you're opening yourself up to everything being overfit. — ClancyStats, Oct 31 '19 at 17:10
You have a valid point ClancyStats - however what I have listed above in Model A is very common. I am not "choosing" the model based on the test set. I am choosing the model based on the 3 sets, then testing the model on a live feed, logging the data, and only after some time when the model proves to be valid I will then use the model. Choosing the model based on the validation set prevents you from maximizing the models potential. What I mean by that is the algorithm is using the val set to update parameters, but you don't actually see how it does on unseen data. — Matt Elgazar, Oct 31 '19 at 17:18
I'll even give the scenario where I take out the test set - it's the same problem. The gridsearch algorithm picks the model with the highest validation score. It doesn't look at the difference between the train score and val score. The difference should be close to 0. A train score of 99% and a val score of 88% is not a good model, but grid search will take that over train score of 88% and val score of 87%. I would choose the second model. — Matt Elgazar, Oct 31 '19 at 17:23
If validation is being done correctly, the model with the higher validation score is likely to be the better model, regardless of the training score. Your first comment though, leads me to believe you have a possible time dependency (your test set is new data coming in from a live feed). You'd be better off manually setting up cross validation folds that use chunks of time rather than being randomly selected throughout. — ClancyStats, Oct 31 '19 at 19:26
I see what you're saying but I already split my data by time dependency. I disagree with your comment that the model with the highest validation score is likely the better model. Training score does matter in relation to the validation score. Regardless of how you split your data, even if you get a better model by splitting a certain way, that does not answer my question about how to return the best model. The best model will be the model that does equally well across ALL data sets. Consistent results are key. You can't just have a model that does 99% on train and val and only 50% on test. — Matt Elgazar, Nov 01 '19 at 15:58

score 2 · Answer 1 · answered Jul 11 '20 at 01:46

I agree with the comments that using the test set to choose hyperparameters obviates the need for the validation set (/folds), and makes the test set scores no longer representative of future performance. You fix that by "testing the model on a live feed," so that's fine.

I'll even give the scenario where I take out the test set - it's the same problem. The gridsearch algorithm picks the model with the highest validation score. It doesn't look at the difference between the train score and val score. The difference should be close to 0. A train score of 99% and a val score of 88% is not a good model, but grid search will take that over train score of 88% and val score of 87%. I would choose the second model.

Now this is something that's more understandable: there are reasons outside of raw performance to want the train/test score gap to be small. See e.g. https://datascience.stackexchange.com/q/66350/55122. And sklearn actually does accommodate this since v0.20: by using return_train_score=True and refit as a callable that consumes cv_results_ and returns the best index:

refit : bool, str, or callable, default=True

...

Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given cv_results_. In that case, the best_estimator_ and best_params_ will be set according to the returned best_index_ while the best_score_ attribute will not be available.

...

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Of course, that requires you can put your manual process of looking at scores and their differences down into a function, and probably doesn't admit anything like validation curves, but at least it's something.

How to properly select the best model in GridSearchCV - both sklearn and caret do it wrong

1 Answers1