0

I am trying to run a grid search where the model will be trained on my training set and tested only on a preset validation set (as requested by a manuscript reviewer).

I have broken my data into a train, validation and test cohort, and will train and tune with the validation cohort, and test the final model with the test cohort. I recognize that GridSearchCV is ideal, but I am in need of performing a grid search without the CV aspect.

ex_parameters_to_be_tuned = {
'learning_rate' : [0.1, 0.01, 0.001,0.0001],
'subsample' : [0.25, 0.50, 0.75, 1]
}

model = lgb.LGBMRegressor(objective = 'Regression', metric = 'rmse', boosting = 'gbdt')

#####Need to switch this to not CV and make it so it trains on "training data" and tests on "validation data"

grid = GridSearchCV(estimator = model, param_grid=ex_parameters_to_be_tuned, scoring = 'neg_root_mean_squared_error')
grid.fit(X_valid, y_valid)

print('best score:', grid.best_score_)
print('best param:', grid.best_params_)

I would like it to be something like

grid.fit(X_train, y_train)
grid.test(X_valid, y_valid)

How can I do a grid search without CV using train and validation data only?

JVDeasyas123
  • 263
  • 3
  • 14
  • 1
    Did you google your question? https://stackoverflow.com/questions/29503689/how-to-run-gridsearchcv-without-cross-validation – jhso Aug 10 '21 at 01:36
  • I did, but this does not seem to answer my question of how to use my fit my model on my train data set and then test it on a prespecified validation dataset. – JVDeasyas123 Aug 10 '21 at 02:20
  • 1
    This one seems a bit better: https://stackoverflow.com/a/34625341/10475762. Just create a paramater grid and loop over. It won't be a one-liner but should do the job. – jhso Aug 10 '21 at 04:18
  • Isn't this a bad idea..? Since you are then using the same data twice. What is your motivation for this? Which publication is the review for? – jtlz2 Aug 10 '21 at 07:28
  • We will not use the same data twice. The validation cohort will only be used for hyper parameter optimization and then a separate test cohort will be used for assessing the model. In essence this would be the same conceptually as using the train dataset with CV for hyper parameter optimization, and testing on a seperate test dataset. Please explain if this is not correct in your understanding – JVDeasyas123 Aug 10 '21 at 10:09

1 Answers1

0

The easiest way I can think of is to keep the train, validation (and test) cohorts in the same dataset (for example with X for the features and y for the labels) and specify the corresponding indices of the data points to be used for fitting and validating the estimator.

For example, suppose you have the indices of your training samples in train_indices and the indices of your validation samples in validation_indices. Then you can pass them as a tuple wrapped in a list to the cv parameter of GridSearchCV.

ex_parameters_to_be_tuned = {
'learning_rate' : [0.1, 0.01, 0.001, 0.0001],
'subsample' : [0.25, 0.50, 0.75, 1]
}

model = lgb.LGBMRegressor(objective = 'Regression', metric = 'rmse', boosting = 'gbdt')

cv = [(train_indices, validation_indices)]

grid = GridSearchCV(estimator = model, param_grid=ex_parameters_to_be_tuned, cv=cv, scoring = 'neg_root_mean_squared_error')
grid.fit(X, y)

print('best score:', grid.best_score_)
print('best param:', grid.best_params_)

This will always train and validate the estimator on the same samples.

afsharov
  • 4,774
  • 2
  • 10
  • 27