Held out training and validation set in gridsearchcv sklearn

Question

I see that in gridsearchcv best parameters are determined based on cross-validation, but what I really want to do is to determine the best parameters based on one held out validation set instead of cross validation.

Not sure if there is a way to do that. I found some similar posts where customizing the cross-validation folds. However, again what I really need is to train on one set and validate the parameters on a validation set.

One more information about my dataset is basically a text series type created by panda.

Have you tried looking into the cv parameter of gridsearchcv class. It can take as an iterable the splits that you want. You can append your validation set to the training set and pass an iterable that gives the split at training and validation? — Abhinav Arora, Jun 15 '16 at 01:06
Does this answer your question? [Using explicit (predefined) validation set for grid search with sklearn](https://stackoverflow.com/questions/31948879/using-explicit-predefined-validation-set-for-grid-search-with-sklearn) — Ben Reiniger, May 30 '21 at 17:36

score 4 · Accepted Answer · answered Jun 19 '16 at 04:33

I did come up with an answer to my own question through the use of PredefinedSplit

for i in range(len(doc_train)-1):
    train_ind[i] = -1

for i in range(len(doc_val)-1):
    val_ind[i] = 0

ps = PredefinedSplit(test_fold=np.concatenate((train_ind,val_ind)))

and then in the gridsearchCV arguments

grid_search = GridSearchCV(pipeline, parameters, n_jobs=7, verbose=1 ,   cv=ps)

cgnorthcutt · Answer 2 · 2018-12-11T22:55:43.847

1

Use the hypopt Python package (pip install hypopt). It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.

# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
  {'C': [1, 10, 100], 'kernel': ['linear']},
  {'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))

edited Dec 11 '18 at 22:55

answered Jul 07 '18 at 11:09

cgnorthcutt

3,890
34
41

do you then use the hyperparamaters from gridsearch to train on your final model e.g.: opt.best_estimator_.fit(train,y) and use this as your final result – Maths12 Feb 07 '20 at 17:36
@Maths12 it's already been trained. You can just predict directly using the best estimator. – cgnorthcutt Feb 08 '20 at 18:09
what is the difference between using sklearns gridsearch and hypopt? i thought sklearn gridsearch cv did hold out a validation set? – Maths12 May 21 '20 at 15:38
@maths12 sklearn uses cross validation which is slow and uses less data to train on because it takes the validation set out of your dataset. It trains for every fold (4x training time for 5 fold CV). Hypopt uses a predefined validation set that you already have. Hypopt can also do cross validation if you don't have a predefined validation set, and that is not different than sklearn. But typically you use hypopt with a predefined validation set. – cgnorthcutt May 22 '20 at 16:49

Held out training and validation set in gridsearchcv sklearn

2 Answers2