21

I'm trying to get the best set of parameters for an SVR model. I'd like to use the GridSearchCV over different values of C. However, from the previous test, I noticed that the split into the Training/Test set highly influences the overall performance (r2 in this instance). To address this problem, I'd like to implement a repeated 5-fold cross-validation (10 x 5CV). Is there a built-in way of performing it using GridSearchCV?

Quick solution, following the idea presented in the sci-kit official documentation:

NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
     cv = KFold(n_splits=5, shuffle=True, random_state=i)
     clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
     scores.append(clf.best_score_)
print "Average Score: {0} STD: {1}".format(numpy.mean(scores), numpy.std(scores))
Julia Meshcheryakova
  • 3,162
  • 3
  • 22
  • 42
Titus Pullo
  • 3,751
  • 15
  • 45
  • 65
  • To understand better, your goal would be to repeat 5CV in order to see how SVR behaves? Which means you will be using 10x5 different splits for each parameter combination? In any case, you can provide a custom cv function that does that and yields a dataset split as many times as you want or customize it however you need it. GridSearchCV will consider it as a run with the selected parameters each time and it will gather the results at the end as usual. – mkaran Feb 15 '17 at 11:38
  • @Titus Pullo, please accept the answer if any one of them has helped you. – learnToCode Oct 22 '21 at 07:13

2 Answers2

35

This is called as nested cross_validation. You can look at official documentation example to guide you into right direction and also have a look at my other answer here for a similar approach.

You can adapt the steps to suit your need:

svr = SVC(kernel="rbf")
c_grid = {"C": [1, 10, 100, ...  ]}

# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.

# To be used within GridSearch (5 in your case)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)

# To be used in outer CV (you asked for 10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)

# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_score = clf.best_score_

# Pass the gridSearch estimator to cross_val_score
# This will be your required 10 x 5 cvs
# 10 for outer cv and 5 for gridSearch's internal CV
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()

Edit - Description of nested cross validation with cross_val_score() and GridSearchCV()

  1. clf = GridSearchCV(estimator, param_grid, cv= inner_cv).
  2. Pass clf, X, y, outer_cv to cross_val_score
  3. As seen in source code of cross_val_score, this X will be divided into X_outer_train, X_outer_test using outer_cv. Same for y.
  4. X_outer_test will be held back and X_outer_train will be passed on to clf for fit() (GridSearchCV in our case). Assume X_outer_train is called X_inner from here on since it is passed to inner estimator, assume y_outer_train is y_inner.
  5. X_inner will now be split into X_inner_train and X_inner_test using inner_cv in the GridSearchCV. Same for y
  6. Now the gridSearch estimator will be trained using X_inner_train and y_train_inner and scored using X_inner_test and y_inner_test.
  7. The steps 5 and 6 will be repeated for inner_cv_iters (5 in this case).
  8. The hyper-parameters for which the average score over all inner iterations (X_inner_train, X_inner_test) is best, is passed on to the clf.best_estimator_ and fitted for all data, i.e. X_outer_train.
  9. This clf (gridsearch.best_estimator_) will then be scored using X_outer_test and y_outer_test.
  10. The steps 3 to 9 will be repeated for outer_cv_iters (10 here) and array of scores will returned from cross_val_score
  11. We then use mean() to get back nested_score.
Meng Zhang
  • 337
  • 1
  • 4
  • 13
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • 1
    I don't want a nested CV, I simply want to repeat the CV 10 times, each time using a different split of the data into training and test set. – Titus Pullo Feb 15 '17 at 09:41
  • As far as I understand, this is what the `outer_cv` is doing. It will split the data into training and test 10 times (`n_split`) and `cross_val_score` will score it against the grid_search (`clf`) which in turn will split the data passed into it (i.e. the training data from `outer_cv`) again into train and test to find best params. – Vivek Kumar Feb 15 '17 at 11:00
  • Can you give an example of what you want to do actually? – Vivek Kumar Feb 15 '17 at 11:00
  • For a fixed set of parameters I'd like to obtain 10 AUC values calculated using 10 different 5CV in order to check how different split into training and test set affect the AUC values. – Titus Pullo Feb 15 '17 at 12:43
  • So in those 10 iterations, do you want to split the data into train and test, and then send train data into 5CV, which will again split the train data into its own train and test, or just want to send all data to 5CV to get the score 10 times. If you want latter case, then just iterate a loop and set different `random_state` into the 5CV iterator. – Vivek Kumar Feb 16 '17 at 06:33
  • Yes, I updated the question adding this simple solution, I was hoping for a built-in solution. – Titus Pullo Feb 16 '17 at 11:19
  • I just realised that your solution, but in general the "so called" nested cross validation is simply a cross validation repeated twice, where each time the SAME data are split into training and test set in different way, correct? – Titus Pullo Feb 17 '17 at 11:24
  • @Lazza87 I have updated my answer to clear about nested cross validation – Vivek Kumar Feb 17 '17 at 12:11
  • @VivekKumar can you explain clf.fit(X_iris, y_iris) ? Its comes between your step 1 and 2 ? What happens here ? Do you mean that this statement is executed later on ? – utengr Mar 02 '17 at 14:18
  • I see that you have deleted your question. No, it doesnt come between step 1 and 2. I have edited the answer for better understanding. – Vivek Kumar Mar 02 '17 at 16:18
  • You are a lif saver, finally a complete-concrete example in one answer and simple with enough explanation.. plus one! – Mike Aug 07 '18 at 14:31
  • I cannot see why you need the extra `clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)` at the end before `cross_val_score`?! – Mike Aug 07 '18 at 15:58
  • @Mike. Those are two different code snippets. Upper one is for non-nested cv and below is for nested cv. – Vivek Kumar Aug 08 '18 at 12:24
  • No You need to use the first estimator `clf.best_estimator_` in the `cross_val_score(HERE, X=X_iris, y=y_iris, cv=outer_cv).mean()` instead of creating a new one.. otherwise it will be no point of creating the fist one if you don't wanna use its result! – Yahya Aug 09 '18 at 11:48
  • @Yahya. `clf` by default will use `clf.best_estimator_` internally when `clf.predict()` is called, along with parameter tuning on training data of each fold. – Vivek Kumar Aug 09 '18 at 11:51
  • How is that and you re-created/re-initialized `clf` before the last step ?! – Yahya Aug 09 '18 at 11:59
  • @Yahya Yes I know. The steps above describe that case only. Please read them carefullt. – Vivek Kumar Aug 09 '18 at 12:00
  • Do you mean the `clf.best_estimator_` will be assigned to `svr`? – Yahya Aug 09 '18 at 12:12
  • @Yahya, `clf.best_estimator_` is an svr which is fitted on `X_train` as described in step 8. How its tuned is mentioned in steps before that. – Vivek Kumar Aug 09 '18 at 12:26
  • In step 8, I think `X_train` should be `X_outer_train` – Meng Zhang Aug 11 '18 at 13:20
  • @MingLi Yes. Thanks – Vivek Kumar Aug 12 '18 at 04:59
  • 2
    @VivekKumar Thanks a lot for the detailed explanation. You took the example from [scikit-learn](http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html) - so it seems to be a common approach. An aspect I don't get with nested cross-validation is why the outer CV triggers the grid-search `n_splits=10` times. I would expect the outer CV to test only the best model (with fixed params) with 10 different splits. Here, the outer CV compares 10 different models (possibly with 10 different set of params), which I consider a bit problematic. – normanius Aug 27 '18 at 16:41
30

You can supply different cross-validation generators to GridSearchCV. The default for binary or multiclass classification problems is StratifiedKFold. Otherwise, it uses KFold. But you can supply your own. In your case, it looks like you want RepeatedKFold or RepeatedStratifiedKFold.

from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold

# Define svr here
...

# Specify cross-validation generator, in this case (10 x 5CV)
cv = RepeatedKFold(n_splits=5, n_repeats=10)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)

# Continue as usual
clf.fit(...)
AdamRH
  • 301
  • 3
  • 5
  • This doesn't work for me. I get the following error: `TypeError: 'RepeatedKFold' object is not iterable` – tmastny Feb 24 '18 at 00:26
  • 1
    @tmastny I can't reproduce this error. Is it related to [this post](https://stackoverflow.com/questions/43176916/typeerror-shufflesplit-object-is-not-iterable)? That is, is your `GridSearchCV` coming from `sklearn.model_selection` or from `sklearn.grid_search`? – AdamRH Feb 25 '18 at 12:44
  • Awesome, it works now. Thanks for your patience. This is definitely the most up to date answer, and makes repeated k-fold tuning very straightforward. – tmastny Feb 25 '18 at 14:35
  • This is amazing, plus one from me :) – Mike Aug 07 '18 at 14:33
  • 1
    This should be the correct solution, not the one with nested CV – SiXUlm Feb 22 '19 at 21:52
  • IMO this should be the accepted answer. Clean and concise. – learnToCode Oct 22 '21 at 07:12