How to run sklearn.model_selection.GridSearchCV without splitting data?

Question

I would like to evaluate the performance of a model pipeline. I am not training my model on the ground-truth labels that I am evaluating the pipeline against, therefore doing a cross-validation scheme is unnecessary. However, I would still like to use the grid search functionality provided in sklearn.

Is it possible to use sklearn.model_selection.GridSearchCV without splitting the data? In other words, I would like to run Grid Search and get scores on the full dataset that I pass in to the pipeline.

Here is a simple example:

I might wish to choose the optimal k for KMeans. I am actually going to be using KMeans on many datasets that are similar in some sense. It so happens that I have some ground-truth labels for a few such datasets, which I will call my "training" data. So, instead of using something like BIC, I decide to simply pick the optimal k for my training data, and employ that k for future datasets. Search over possible values of k is a grid search. KMeans is available in the sklearn library, so I can very easily define a grid search on this model. Incidentally, KMeans takes in an "empty" y value, which simply passes through and can be used in a GridSearchCV scorer. However, there is no sense in doing cross-validation here, since my individual kmeans models never see the ground truth labels and are therefore incapable of overfitting.

To be clear, the above example is simply a contrived example to justify a possible use case for such a thing for those who are afraid that I might abuse this functionality. The solution to the example above that I am interested in is how to not split the data in GridSearchCV.

Is it possible to use sklearn.model_selection.GridSearchCV without splitting the data?

Your doing `GridSearchCV` equals to doing `CV`. Of course from technical standpoint you may do that without `train/test` split. But that invalidates `train/validate/test` philosophy commonly accepted in ML. — Sergey Bushmanov, Feb 19 '20 at 16:28
@SergeyBushmanov the train/validate/test philosophy assumes that one is training your data against the ground truth labels (the same labels that one might be testing against). My training pipeline does not use the ground truth labels. Therefore, cross-validation does nothing, and overfitting to the ground truth labels is impossible. — Him, Feb 19 '20 at 16:55
"Of course from technical standpoint you may do that without train/test split." Of course it is. However, I am wondering if the existing gridsearch helpers in sklearn can aid in this? — Him, Feb 19 '20 at 16:56
In theory you can do `gs=GridSearchCV(scoring=None, cv=None); gs.fit(X, None)` but you should be more specific in what your problem is.... — Sergey Bushmanov, Feb 19 '20 at 17:02
`cv=None`: "None, to use the default 5-fold cross validation,". This does not turn off cross-validation. — Him, Feb 19 '20 at 17:17

score 2 · Answer 1 · answered Feb 19 '20 at 19:22

The docs claim that the cv parameter in the GridSearchCV constructor optionally is capable of accepting "An iterable yielding (train, test) splits as arrays of indices." It turns out that the "arrays of indices" bit is irrelevant, and it is possible to send in arbitrary objects that can be used to index arrays. If we hand in a thing that gives the whole array for both the train and the test split, we can circumvent the cross-validation behavior.

This is one way to accomplish that thing that corresponds to the example given in the question:

grid_search = sklearn.model_selection.GridSearchCV(
    sklearn.cluster.KMeans(),
    {"k": [2,3,4,5,7,10,20]},
    cv=(((slice(None), slice(None)),)
)

If you pass the ground-truth labels as y to this, it will evaluate the outcome of each run of KMeans corresponding to the various k against the entire dataset.

score -1 · Answer 2 · answered Feb 19 '20 at 16:18

-1

You need to do cross-validation if you do grid search, otherwise you will overfit on the test data, because you evaluate several settings of hyper-parameters on the same data.

answered Feb 19 '20 at 16:18

BlackBear

22,411
10
48
86

I reiterate that I am not employing the ground truth labels during training. It is impossible to overfit to data that I do not provide to the training algorithm. – Him Feb 19 '20 at 16:53
@Scott then I do not understand what you are doing. Could you clarify? Do you not use labels for training? Is some kind of unsupervised learning? – BlackBear Feb 19 '20 at 17:01
sort of. Note that grid search is simply an optimization method. Various optimizers that don't involve cross-validation are used for model selection all the time. For example, gradient decent methods optimize over parameters when the optimization surface is differentiable. My optimization requires grid search. The 'hyperparameters' that usually go into GridSearchCV are, in fact, simply my model parameters. – Him Feb 19 '20 at 17:04
I will try to contrive a simple example. – Him Feb 19 '20 at 17:04

How to run sklearn.model_selection.GridSearchCV without splitting data?

2 Answers2