5

I was reading the docs of SkLearn around nested cross validation and I discovered at this SkLearn page the following example of nested cross validation:

from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
import numpy as np

print(__doc__)

# Number of random trials
NUM_TRIALS = 30

# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Set up possible values of parameters to optimize over
p_grid = {"C": [1, 10, 100],
          "gamma": [.01, .1]}

# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")

# Arrays to store scores
non_nested_scores = np.zeros(NUM_TRIALS)
nested_scores = np.zeros(NUM_TRIALS)

# Loop for each trial
for i in range(NUM_TRIALS):

    # Choose cross-validation techniques for the inner and outer loops,
    # independently of the dataset.
    # E.g "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.
    inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
    outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)

    # Non_nested parameter search and scoring
    clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)
    clf.fit(X_iris, y_iris)
    non_nested_scores[i] = clf.best_score_

    # Nested CV with parameter optimization
    nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
    nested_scores[i] = nested_score.mean()

score_difference = non_nested_scores - nested_scores

print("Average difference of {0:6f} with std. dev. of {1:6f}."
      .format(score_difference.mean(), score_difference.std()))

# Plot scores on each trial for nested and non-nested CV
plt.figure()
plt.subplot(211)
non_nested_scores_line, = plt.plot(non_nested_scores, color='r')
nested_line, = plt.plot(nested_scores, color='b')
plt.ylabel("score", fontsize="14")
plt.legend([non_nested_scores_line, nested_line],
           ["Non-Nested CV", "Nested CV"],
           bbox_to_anchor=(0, .4, .5, 0))
plt.title("Non-Nested and Nested Cross Validation on Iris Dataset",
          x=.5, y=1.1, fontsize="15")

# Plot bar chart of the difference.
plt.subplot(212)
difference_plot = plt.bar(range(NUM_TRIALS), score_difference)
plt.xlabel("Individual Trial #")
plt.legend([difference_plot],
           ["Non-Nested CV - Nested CV Score"],
           bbox_to_anchor=(0, 1, .8, 0))
plt.ylabel("score difference", fontsize="14")

plt.show()

I do not know if I am missing something but is this exactly an example of nested cross validation?

In my view, the problem is that at this example there both the parameter optimisation and the model evaluation are done with the same dataset while properly the former should be done with the training & validation sets and the latter with the test set.

Specifically, at this example, both the "inner" and "outer" loops are using the whole of X_iris and y_iris since:

  • clf.fit(X_iris, y_iris) (Grid Search)
  • cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
Outcast
  • 4,967
  • 5
  • 44
  • 99
  • 1
    Your question is unclear; the existence of any test set is independent of & external to the CV procedure which, generally speaking, involves the creation of *training* and *validation* folds (but not *test* ones) from the initial data... – desertnaut Sep 12 '18 at 13:18
  • Yes, exactly but at the SkLearn example it is written `clf.fit(X_iris, y_iris)` and `cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)` that is that both the hyperparameter optimization and the model evaluation are done exactly with the same `X_iris` and `y_iris`. – Outcast Sep 12 '18 at 13:22
  • Any ideas @desertnaut? Where have you gone suddenly? – Outcast Sep 12 '18 at 14:00
  • Sorry no, busy with stuff... But I concur that the terminology used by sklearn here ("nested", "inner", "outer") is indeed puzzling & not justifiable by their code example (so, +1 for your question)... – desertnaut Sep 12 '18 at 14:08
  • No worries...Sometimes I am wondering how other data scientists did not spot things like these before for so long time...haha... – Outcast Sep 12 '18 at 14:11
  • Sometimes newcomers can spot such things much easier (have never looked at this myself); but BTW, in other cases they can get unnecessarily confused by over-reading things (re your other Q on `LabelEncoder`)... – desertnaut Sep 12 '18 at 14:13
  • Haha, ok. However, believe me that I did not over-read anything regarding LabelEncoder(). I just listened to a senior Data Scientist saying something which at best was not clear enough. Anyways, thanks for the comments so far! – Outcast Sep 12 '18 at 14:16
  • 1
    See my answer on the description of above here:https://stackoverflow.com/a/42230764/3374996 – Vivek Kumar Sep 12 '18 at 15:49
  • Thanks @VivekKumar. However, I do not understand how your answer there responds to my question. Are you simply saying there that what is called nested cross validation in the SkLearn's docs is essentially a twice repeated cross validation? If so then this was exactly my point. – Outcast Sep 12 '18 at 17:20
  • Check this out. I tried to explain there about nested CV - https://stackoverflow.com/questions/52138897/fitting-in-nested-cross-validation-with-cross-val-score-with-pipeline-and-gridse/52147410#52147410 – ShaharA Sep 17 '18 at 14:42
  • Nice one @ShaharA (upvote) but I had in mind what is nested cross validation when I was writing my post. My point was that what SkLearn calls nested cross validation is not a real one. – Outcast Sep 17 '18 at 14:49
  • @Poete - you're totally right. they do not use the current fold of the outer CV in the inner CV in their example, but the entire dataset, which is wrong and misleading – ShaharA Sep 17 '18 at 14:53
  • @ShaharA ;) ... – Outcast Sep 17 '18 at 17:29
  • 3
    @ShaharA, @PoeteMaudit, I know I'm late to the party, but sklearn's nested example is actually fine. The most confusing bit is their use of `clf.fit(X_iris, y_iris)` before sending it into `cross_val_score`. However, `cross_val_score` will **refit** `clf` (see how in their example they provide an unfitted classifier: [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)) - this time, when fitting inside `cross_val_score`, it will be fitted using the correct folds. I hope that clears it. – ehudk Mar 09 '19 at 12:34

0 Answers0