25

I'm currently working on a problem which compares three different machine learning algorithms performance on the same data-set. I divided the data-set into 70/30 training/testing sets and then performed grid search for the best parameters of each algorithm using GridSearchCV and X_train, y_train.

First question, am I suppose to perform grid search on the training set or is it suppose to be on the whole data-set?

Second question, I know that GridSearchCV uses K-fold in its' implementation, does it mean that I performed cross-validation if I used the same X_train, y_train for all three algorithms I compare in the GridSearchCV?

Any answer would be appreciated, thank you.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
kevinH
  • 345
  • 2
  • 4
  • 7

2 Answers2

52

All estimators in scikit where name ends with CV perform cross-validation. But you need to keep a separate test set for measuring the performance.

So you need to split your whole data to train and test. Forget about this test data for a while.

And then pass this train data only to grid-search. GridSearch will split this train data further into train and test to tune the hyper-parameters passed to it. And finally fit the model on the whole train data with best found parameters.

Now you need to test this model on the test data you kept aside in the beginning. This will give you the near real world performance of model.

If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.

You can look at my other answers which describe the GridSearch in more detail:

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • GridSearchCV has a parameter, cv, in which you specify the number of folds of CV to use. Does this mean that, for example, when I have 10 possible hyperparameter combinations to test, GridSearchCV tests *all* these combinations by using 5 fold CV (So basically 10x 5 fold cv)? – Psychotechnopath Jan 08 '20 at 13:16
  • 2
    @Psychotechnopath Yes. That will be printed when the gridsearch starts. You can get more details by using `verbose` param in `GridSearchCV`. – Vivek Kumar Jan 08 '20 at 16:14
  • Say i want to use 2 folds, and that i'm working with time series. First, i need to split my dataset, for example, with `tscv = TimeSeriesSplit()`. `TimeSeriesSplit()` already gives the train/test set for those 2 folds (lets call them train1/test1 and train2/test2). Then, i can pass this parameter, `tscv` to `GridSearchCV (..., cv = tscv, ...)`, and it will again split the train parts that i got from `TimeSeriesSplit` (train1 and train2) into a "smaller" train/validation sets (smalltrain1/validation1 and smalltrain2/validation2) to train and evaluate my model? – Murilo Dec 11 '21 at 18:41
  • @MuriloAraujoSouza No, it will use the same splits, which tscv gave. Or are you saying that first you divide the data into train and test and then you are passing only the train data into the gridsearch along with tscv? If yes, then yes it will divide your original train dataset into smaller train and val dataset. – Vivek Kumar Dec 12 '21 at 04:06
  • I am using `tscv = TimeSeriesSplit(n_splits = 2)`, after that i do `grid_search_RF = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_RF, cv = tscv)` and then i fit my model in my whole dataset `grid_search_RF.fit(x, y)`. Not really sure if those are the correct steps. – Murilo Dec 12 '21 at 10:02
  • @MuriloAraujoSouza In this case, your dataset is splitted according to `tscv` inside the gridsearchCV. – Vivek Kumar Dec 13 '21 at 04:15
7

Yes, GridSearchCV performs cross-validation. If I understand the concept correctly - you want to keep part of your data set unseen for the model in order to test it.

So you train your models against train data set and test them on a testing data set.

Here I was doing almost the same - you might want to check it...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419