I am trying to understand how exactly the GridSearchCV in scikit-learn implements the train-validation-test principle in machine learning. As you see in the following code, I understand what it does is as follows:
- split the 'dataset' into 75% and 25%, where 75% is used for param tuning, and 25% is the held out test set (line 1)
- init some parameters to search (lines 3 to 6)
- fit the model on the 75% of dataset, but split this dataset into 5 folds, i.e., each time train on 60% of the data, test on the other 15%, and do this 5 times (lines 8 - 10). I have my first and second questions here, see below.
- take the best performing model and parameters, test on the holdout data (lines 11-13)
Question 1: what is exactly going on in step 3 with respect to the parameter space? Is GridSearchCV trying every parameter combination on every one of the five runs (5-fold) so giving a total of 10 runs? (i.e., the single param from 'optmizers', 'init', and 'batches' is paired with the 2 from 'epoches']
Question 2: what scores does line 'cross_val_score' print? Is this the average of the 10 above runs on the single fold of the data in each of the 5 runs? (i.e., the average of five 15% of the entire dataset)?
Question 3: suppose line 5 now has only 1 parameter value, this time GridSearchCV is really not searching any parameters because each parameter has only 1 value, is this correcct?
Question 4: in case explained in question 3, if we take a weighted average of the scores computed on the 5-folds of GridSearchCV runs and the heldout run, that gives us an average peformance score on the entire dataset - this is very similar to a 6-fold cross-validation experiment (i.e., without gridsearch), except the 6 fold are not entirely equal size. Or is this not?
Many thanks in advance for any replies!
X_train_data, X_test_data, y_train, y_test = \
train_test_split(dataset[:,0:8], dataset[:,8],
test_size=0.25,
random_state=42) #line 1
model = KerasClassifier(build_fn=create_model, verbose=0)
optimizers = ['adam'] #line 3
init = ['uniform']
epochs = [10,20] #line 5
batches = [5] # line 6
param_grid = dict(optimizer=optimizers, epochs=epochs, batch_size=batches, init=init)
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5) # line 8
grid_result = grid.fit(X_train_data, y_train)
cross_val_score(grid.best_estimator_, X_train_data, y_train, cv=5).mean() #line 10
best_param_ann = grid.best_params_ #line 11
best_estimator = grid.best_estimator_
heldout_predictions = best_estimator.predict(X_test_data) #line 13