What is "the left out data"? How the data is being left out by `sklearn.model_selection.GridSearchCV`?

Question

The doc of sklearn.model_selection.GridSearchCV says

Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False.

...

The parameters selected are those that maximize the score of the left out data, unless an explicit score is passed in which case it is used instead.

lots people of SO also use this term.

What is "the left out data"? Is it the left out part of cross-validation, for instance, 1/10 of the dataset?

How the data is being left out by sklearn.model_selection.GridSearchCV?

Not entirely sure If I'm correct on this but If I recall correctly, the left out data is that which is used for cross validation and it's being split from the total data randomly. Usually you use more like 25% of the data for Cross Validation though. — Mike, Aug 28 '19 at 23:10

score 0 · Accepted Answer · answered Aug 28 '19 at 23:15

From the documentation, this Grid search method takes in a parameter called cv:

cv : int, cross-validation generator or an iterable, optional

This determines the value of K in KFold cross validation. It also provides other strategies you can follow.

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

So to answer your question, the grid search loops through the parameter space and for each parameter, conducts, for eg, 3-fold cross validation. As you can guess, this will involve leaving some (1/3rs in this case) data out at each step to calculate the prediction accuracy. This is the data that is being left out.

What is "the left out data"? How the data is being left out by `sklearn.model_selection.GridSearchCV`?

1 Answers1