6

I'm having a hard time figuring out parameter return_train_score in GridSearchCV. From the docs:

return_train_score : boolean, optional

       If False, the cv_results_ attribute will not include training scores.

My question is: what are the training scores?

In the following code I'm splitting data into ten stratified folds. As a consequence grid.cv_results_ contains ten test scores, namely 'split0_test_score', 'split1_test_score' , ..., 'split9_test_score'. I'm aware that each of those is the success rate obtained by a 5-nearest neighbors classifier that uses the corresponding fold for testing and the remaining nine folds for training.

grid.cv_results_ also contains ten train scores: 'split0_train_score', 'split1_train_score' , ..., 'split9_train_score'. How are these values calculated?

from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold    

X, y = datasets.load_iris(True)

skf = StratifiedKFold(n_splits=10, random_state=0)
knn = KNeighborsClassifier()

grid = GridSearchCV(estimator=knn, 
                    cv=skf, 
                    param_grid={'n_neighbors': [5]}, 
                    return_train_score=True)
grid.fit(X, y)

print('Mean test score: {}'.format(grid.cv_results_['mean_test_score']))
print('Mean train score: {}'.format(grid.cv_results_['mean_train_score']))
#Mean test score: [ 0.96666667]
#Mean train score: [ 0.96888889]
Community
  • 1
  • 1
Tonechas
  • 13,398
  • 16
  • 46
  • 80

2 Answers2

4

It is the train score of the prediction model on all folds excluding the one you are testing on. In your case, it is the score over the 9 folds you trained the model on.

Jan K
  • 4,040
  • 1
  • 15
  • 16
  • Thank you Jan K and @Vivek Kumar for your helpful answers. Is _train score_ defined anywhere in the documentation? If yes, could you provide me with a link? – Tonechas Apr 18 '18 at 13:13
  • 2
    @Tonechas The `return_train_score` param on [GridSearchCV documentation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) has some info about it – Vivek Kumar Apr 18 '18 at 14:43
2

Maybe my other answer here will give you clear understanding of working in grid-search.

Essentially training scores are the score of model on the same data on which its trained on.

In each fold split, data will be divided into two parts: train and test. Train data will be used to fit() the internal estimator and test data will be used to check the performance of that. training score is just to check how well the model fit the training data.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132