I would like to evaluate the performance of a model pipeline. I am not training my model on the ground-truth labels that I am evaluating the pipeline against, therefore doing a cross-validation scheme is unnecessary. However, I would still like to use the grid search functionality provided in sklearn.
Is it possible to use sklearn.model_selection.GridSearchCV
without splitting the data? In other words, I would like to run Grid Search and get scores on the full dataset that I pass in to the pipeline.
Here is a simple example:
I might wish to choose the optimal k
for KMeans. I am actually going to be using KMeans on many datasets that are similar in some sense. It so happens that I have some ground-truth labels for a few such datasets, which I will call my "training" data. So, instead of using something like BIC, I decide to simply pick the optimal k
for my training data, and employ that k
for future datasets. Search over possible values of k
is a grid search. KMeans is available in the sklearn library, so I can very easily define a grid search on this model. Incidentally, KMeans takes in an "empty" y
value, which simply passes through and can be used in a GridSearchCV scorer. However, there is no sense in doing cross-validation here, since my individual kmeans models never see the ground truth labels and are therefore incapable of overfitting.
To be clear, the above example is simply a contrived example to justify a possible use case for such a thing for those who are afraid that I might abuse this functionality. The solution to the example above that I am interested in is how to not split the data in GridSearchCV
.
Is it possible to use sklearn.model_selection.GridSearchCV
without splitting the data?