3

I am trying to understand how exactly the GridSearchCV in scikit-learn implements the train-validation-test principle in machine learning. As you see in the following code, I understand what it does is as follows:

  1. split the 'dataset' into 75% and 25%, where 75% is used for param tuning, and 25% is the held out test set (line 1)
  2. init some parameters to search (lines 3 to 6)
  3. fit the model on the 75% of dataset, but split this dataset into 5 folds, i.e., each time train on 60% of the data, test on the other 15%, and do this 5 times (lines 8 - 10). I have my first and second questions here, see below.
  4. take the best performing model and parameters, test on the holdout data (lines 11-13)

Question 1: what is exactly going on in step 3 with respect to the parameter space? Is GridSearchCV trying every parameter combination on every one of the five runs (5-fold) so giving a total of 10 runs? (i.e., the single param from 'optmizers', 'init', and 'batches' is paired with the 2 from 'epoches']

Question 2: what scores does line 'cross_val_score' print? Is this the average of the 10 above runs on the single fold of the data in each of the 5 runs? (i.e., the average of five 15% of the entire dataset)?

Question 3: suppose line 5 now has only 1 parameter value, this time GridSearchCV is really not searching any parameters because each parameter has only 1 value, is this correcct?

Question 4: in case explained in question 3, if we take a weighted average of the scores computed on the 5-folds of GridSearchCV runs and the heldout run, that gives us an average peformance score on the entire dataset - this is very similar to a 6-fold cross-validation experiment (i.e., without gridsearch), except the 6 fold are not entirely equal size. Or is this not?

Many thanks in advance for any replies!

X_train_data, X_test_data, y_train, y_test = \
         train_test_split(dataset[:,0:8], dataset[:,8],
                          test_size=0.25,
                          random_state=42) #line 1

model = KerasClassifier(build_fn=create_model, verbose=0)
optimizers = ['adam']  #line 3
init = ['uniform']
epochs = [10,20] #line 5
batches = [5]   # line 6
param_grid = dict(optimizer=optimizers, epochs=epochs, batch_size=batches, init=init)
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)  # line 8
grid_result = grid.fit(X_train_data, y_train) 
cross_val_score(grid.best_estimator_, X_train_data, y_train, cv=5).mean() #line 10
best_param_ann = grid.best_params_      #line 11
best_estimator = grid.best_estimator_
heldout_predictions = best_estimator.predict(X_test_data)   #line 13
Ziqi
  • 2,445
  • 5
  • 38
  • 65

1 Answers1

2

Question 1: As you said, you dataset will be split in 5 pieces. Every parameters will be tried (in your case 2). For each parameters, model will be trained on 4 of the 5 folds. The remaining one will be used as test. So you are right, in your example, you are going to train 10 times a model.

Question 2: 'cross_val_score' is the average (accuracy, loss or something) on the 5 test folds. This is done to avoid having for example a good result just because the test set was really easy.

Question 3: Yes. It makes no sense if you have only one set of parameter to try to do a grid search

Question 4: I didn't exactly understand your question. Usually, you use a grid search on your train set. This allows you to keep your test set as a validation set. Without cross validation, you could find a perfect setting to maximise results on your test set and you would be overfitting your test set. With a cross validation, you can play as much as you want with fine-tuning parameter as you don't use your validation set to set it up.

In your code, there is no big need of CV as you don't have a lot of parameters to play with, but if you start adding regularization, you may try 10+ and in such case, CV is required.

I hope it helps,

Nick
  • 138,499
  • 22
  • 57
  • 95
Nicolas M.
  • 1,472
  • 1
  • 13
  • 26