Consider 3 data sets train/val/test. sklearn's GridSearchCV()
by default chooses the best model with the highest cross-validation score. In a real-world setting where the predictions need to be accurate, this is a horrible approach to choosing the best model. The reason is that this is how it's supposed to be used:
Train set for the model to learn the dataset
Val set to validate what the model has learned in the train set and update parameters/hyperparameters to maximize the validation score.
Test set - to test your data on unseen data.
Finally, use the model in a live setting and log the results to see if the results are good enough to make decisions. It's surprising that many data scientists impulsively use their trained model in production based only on selecting the model with the highest validation score. I find gridsearch to choose models that are painfully overfitting and do a worse job at predicting unseen data than the default parameters.
My approaches:
Manually train the models and look at the results for each model (in a sort of a loop, but not very efficient). It's very manual and time-consuming, but I get significantly better results than gridsearch. I want this to be completely automated.
Plot the validation curve for each hyperparameter I want to choose, and then pick the hyperparameter that shows the smallest difference between train and val set while maximizing both (i.e. train=98%, val = 78% is really bad, but train=72%, val=70% is acceptable).
As I said, I want a better (automated) method for choosing the best model.
What kind of answer I'm looking for:
I want to maximize the score in the train and validation set, while minimizing the score difference between the train and val sets. Consider the following example from a gridsearch algorithm: There are two models:
Model A: train score = 99%, val score = 89%
Model B: train score = 80%, val score = 79%
Model B is a much more reliable model, and I would choose Model B over Model A any day. It is less overfit, and the predictions are consistent. We know what to expect. However, gridsearch will choose model A since the val score is higher. I find this to be a common problem and haven't found any solution anywhere on the internet. People tend to be so focused on what they learn in school and don't actually think about the consequences of choosing an overfit model. I see redundant posts about how to use gridsearch from sklearn and caret packages and have them choose the model for you, but not how to actually choose the best model.
My approach so far has been very manual. I want an automated way of doing this.
What I do currently is this:
gs = GridSearchCV(model, params, cv=3).fit(X_train, y_train) # X_train and y_train consists of validation sets too if you do it this way, since GridSearchCV already creates a cv set.
final_model = gs.best_estimator_
train_predictions = final_model.predict(X_train)
val_predictions = final_model.predict(X_val)
test_predictions = final_model.predict(X_test)
print('Train Score:', accuracy_score(train_predictions, y_train)) # .99
print('Val Score:', accuracy_score(val_predictions, y_val)) # .89
print('Test Score:', accuracy_score(test_predictions, y_test)) # .8
If I see something like the above, I'll rule out that model and try different hyperparameters until I get consistent results. By manually fitting different models and looking at all 3 of these results, the validation curves, etc... I can decide what the best model is. I don't want to do this manually. I want this process to be automated. The gridsearch algorithm returns overfit models every time. I look forward to hearing some answers.
Another big issue is the difference between val and test sets. Since many problems face a time dependency issue, I'd like to know a reliable way to test the models' performance as time goes on. It's crucial to split the data set by time; otherwise, we are presenting data leakage. One method I'm familiar with is discriminative analysis (fitting a model to see if the model can predict which dataset the example came from: train val test). Another method is KS / KL tests and looking at the distribution of the target variable, or looping through each feature and comparing the distribution.