Gridsearchcv: internal logic

Question

I'm trying to understand how Gridsearchcv's logic works. I looked at here, the official documentation, and the source code, but I couldn't figure out the following:

What is the general logic behind Gridsearchcv?

Clarifications:

If I use the default cv = 5, what are the % splits of the input data into: train, validation, and test?
How often does Gridsearchcv perform such a split, and how does it decide which observation belong to train / validation / test?
Since cross validation is being done, where does any averaging come into play for the hyper parameter tuning? i.e. is the optimal hyper parameter value is one that optimizes some sort of average?

This question here shares my concern, but I don't know how up-to-date the information is and I am not sure I understand all the information there. For example, according to the OP, my understanding is that:

The test set is 25% of the input data set and is created once.
The union of the train set and validation set is correspondingly created once and this union is 75% of the original data.
Then, the procedure creates 5 (because cv = 5) further splits of this 75% into 60% train and 15% validation
The optimized hyper parameter value is one that optimizes the average of some metric over these 5 splits.

Is this understanding correct and still applicable now? And how does the procedure do the original 25%-75% split?

ombk · Accepted Answer · 2020-11-26T22:55:36.300

1

First your split your data into train and test. The testing set is left out for post training and optimization of the model. The gridsearchcv takes the 75% of your data and splits them into 5 slices. First it trains 4 slices and validates on 1, then takes 4 slices introducing the previously left out set for validation and tests on a new set etc... 5 times.

Then the performance of each run can be seen + the average of them to understand overall how your model behaves.

Since you are doing a gridsearch, the best_params will be saved at the end of your modeling to predict your test set.
So to summarize, the best parameters will be chosen and used for your model after the whole training, therefore, you can easily use them to predict(X_test)

Read more here.

Usually if you don't perform CV, the model will try to optimize its weights with preset parameters and the left out test set, will help to assess the model performance. However, for a real model training, it is highly important to re-split the training data into train and validation, where you use the validation to hypertune the parameters of the model (manually). However, over-hyptertuning the model to get the best performance on the validation set is cheating.

Theoretical K-Folds

More details

edited Nov 26 '20 at 22:55

answered Nov 26 '20 at 22:37

ombk

2,036
1
4
16

So it seems my understanding of https://stackoverflow.com/q/48852524/6046501 is correct. A few follow ups please: is the 25%-(5 times 60%+15%) split decided by random? And when does the procedure use the 25% test data? – yurnero Nov 26 '20 at 22:42
The 25% test data is not used and not given to the model, therefore you can test your model on it anytime you like ... you can do it today, tomorrow or next year. Regarding the 60%-15% split, theoretically, you divide your data in some way (let's say randomly) into 5 splits (ONLY ONCE) and you start iterating by training on 4 and validating on 1... – ombk Nov 26 '20 at 22:44
I see, so there is a misunderstanding on my part. I thought Gridsearchcv is fed my entire data, then does its own partition into train+validation+test. You're saying I'm only meant to feed it what *I*, the user, decide to be train+validation. Correspondingly, relative to the input data, the split is actually 80% train + 20% validation. Is this correct? – yurnero Nov 26 '20 at 22:52
@yurnero 1) train, test = train_test_split(blabla) , feed the train to the 5cv and let it do the magic. Regarding your question since you gave 5cvs then yes 80%20% – ombk Nov 26 '20 at 22:54

Gridsearchcv: internal logic

1 Answers1