Nested cross-validation example on Scikit-learn

Question

I'm trying to work my head around the example of Nested vs. Non-Nested CV in Sklearn. I checked multiple answers but I am still confused on the example. To my knowledge, a nested CV aims to use a different subset of data to select the best parameters of a classifier (e.g. C in SVM) and validate its performance. Therefore, from a dataset X, the outer 10-folds CV (for simplicity n=10) creates 10 training sets and 10 test sets:

(Tr0, Te0),..., (Tr0, Te9)

Then, the inner 10-CV splits EACH outer training set into 10 training and 10 test sets:

From Tr0: (Tr0_0,Te_0_0), ... , (Tr0_9,Te0_9)
From Tr9: (Tr9_0,Te_9_0), ... , (Tr9_9,Te9_9)

Now, using the inner CV, we can find the best values of C for every single outer Training set. This is done by testing all the possible values of C with the inner CV. The value providing the highest performance (e.g. accuracy) is chosen for that specific outer Training set. Finally, having discovered the best C values for every outer Training set, we can calculate an unbiased accuracy using the outer Test sets. With this procedure, the samples used to identify the best parameter (i.e. C) are not used to compute the performance of the classifier, hence we have a totally unbiased validation.

The example provided in the Sklearn page is:

inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)

# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_scores[i] = clf.best_score_

# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
nested_scores[i] = nested_score.mean()

From what I understand, the code simply calculates the scores using two different cross-validations (i.e. different splits into training and test set). Both of them used the entire dataset. The GridCV identifies the best parameters using one (of the two CVs), then cross_val_score calculates, with the second CV, the performance when using the best parameters.

Am I interpreting a Nested CV in the wrong way? What am I missing from the example?

You can take a look at [my answer here](https://stackoverflow.com/a/42230764/3374996) to get a step by step analysis. — Vivek Kumar, Oct 06 '17 at 10:35
I got really confused by the names and the order, as I expected outer_cv to be used "before" inner_cv. So, the nesting occurs because we pass clf, that is an instance of GridSearchCV to cross_val_scor? Hence, in simple words, cross_val_score, first split X into X_tr, X_te, then X_tr is passed to clf that, because is an instance of GridSearchCV, will further split X_tr, into X_tr_tr and X_tr_te? — NCL, Oct 06 '17 at 11:22
Yes, you are correct. One `X_tr` is splitted into `X_tr_tr` and `X_tr_te` for number of folds defined in `inner_cv`. Then according to the `outer_cv` some other part of the data becomes `X_tr` which is then again sent to `inner_cv`. Hope it makes sense. — Vivek Kumar, Oct 06 '17 at 11:29
Yes, thanks Vivek. So, if we directly pass clf = SVM() to cross_val_score we obtain a "traditional" cross-fold validation. — NCL, Oct 06 '17 at 13:21
Yes, the whole nested cross-validation takes place because of cross-validation done in the GridSearchCV. If using simple estimators, this becomes the simple cross-validation — Vivek Kumar, Oct 06 '17 at 13:24
Does this answer your question? [scikit-learn GridSearchCV with multiple repetitions](https://stackoverflow.com/questions/42228735/scikit-learn-gridsearchcv-with-multiple-repetitions) — adrin, Oct 25 '21 at 08:45

Nested cross-validation example on Scikit-learn

0 Answers0