4

I'm using python and I would like to use nested cross-validation with scikit learn. I have found a very good example:

NUM_TRIALS = 30
non_nested_scores = np.zeros(NUM_TRIALS)
nested_scores = np.zeros(NUM_TRIALS)
# Choose cross-validation techniques for the inner and outer loops,
# independently of the dataset.
# E.g "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.
inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)

# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_scores[i] = clf.best_score_

# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
nested_scores[i] = nested_score.mean()

How can the best set of parameters as well as all set of parameters (with their corresponding score) from the nested cross-validation be accessed?

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
machinery
  • 5,972
  • 12
  • 67
  • 118

2 Answers2

7

You cannot access individual params and best params from cross_val_score. What cross_val_score does internally is clone the supplied estimator and then call fit and score methods on it with given X, y on individual estimators.

If you want to access the params at each split you can use:

#put below code inside your NUM_TRIALS for loop
cv_iter = 0
temp_nested_scores_train = np.zeros(4)
temp_nested_scores_test = np.zeros(4)
for train, test in outer_cv.split(X_iris):
    clf.fit(X_iris[train], y_iris[train])
    temp_nested_scores_train[cv_iter] = clf.best_score_
    temp_nested_scores_test[cv_iter] = clf.score(X_iris[test], y_iris[test])
    #You can access grid search's params here
nested_scores_train[i] = temp_nested_scores_train.mean()
nested_scores_test[i] = temp_nested_scores_test.mean()
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • what model (hyperparameters) do I use for prediction on new data? May be related to OP's request for "best parameters". – Paul May 15 '18 at 04:36
  • @Paul Please explain in more details. Preferably a new question. You can use the GridSearchCV for the hyper-parameter tuning. – Vivek Kumar May 15 '18 at 05:30
  • Each Grid Search, clf.fit(), in the 4 inner loops will return a different set of hyperparameters (4 sets of hyperparameters). When I wish to predict on new, unseen data, I need some model, M, fit with some hyperparameters. Which one of the 4 sets (or some other set) of hyperparameters do I use for model M? I think this question is posed here: https://stats.stackexchange.com/q/319780 I think the answer is that you just run another, separate Grid Search on full data, as explained in second paragraph of https://stats.stackexchange.com/a/65158 Is that right? – Paul May 17 '18 at 04:45
1

Vivek Kumar's answer is based on using an explicit outer cv for loop. If OP wants to access the best estimator and best params based on sklearn's cross validation workflow, I'd suggest using cross_validate instead of cross_val_score because the former allows you to return the estimator. An added bonus of using cross_validate is that you can specify multiple metrics.

from sklearn.model_selection import cross_validate
scoring = {"auroc": "roc_auc"} # [1]
nested_scores = cross_validate(clf, X=X_iris, y=y_iris, cv=outer_cv, return_estimator=True, random_state=0)

Then you can access the best model from each cv fold:

best_models = nested_scores['estimator']
for i, model in enumerate(best_models):
    best_model = model.best_estimator_
    best_params = model.best_params_

[1] for a list of available scores https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

oustella
  • 63
  • 5