What is the meaning of 'mean_test_score' in cv_result?

Question

Hello I'm doing a GridSearchCV and I'm printing the result with the .cv_results_ function from scikit learn.

My problem is that when I'm evaluating by hand the mean on all the test score splits I obtain a different number compared to what it is written in 'mean_test_score'. Which is different from the standard np.mean()?

I attach here the code with the result:

n_estimators = [100]
max_depth = [3]
learning_rate = [0.1]

param_grid = dict(max_depth=max_depth, n_estimators=n_estimators, learning_rate=learning_rate)

gkf = GroupKFold(n_splits=7)


grid_search = GridSearchCV(model, param_grid, scoring=score_auc, cv=gkf)
grid_result = grid_search.fit(X, Y, groups=patients)

grid_result.cv_results_

The result of this operation is:

{'mean_fit_time': array([ 8.92773601]),
 'mean_score_time': array([ 0.04288721]),
 'mean_test_score': array([ 0.83490629]),
 'mean_train_score': array([ 0.95167036]),
 'param_learning_rate': masked_array(data = [0.1],
              mask = [False],
        fill_value = ?),
 'param_max_depth': masked_array(data = [3],
              mask = [False],
        fill_value = ?),
 'param_n_estimators': masked_array(data = [100],
              mask = [False],
        fill_value = ?),
 'params': ({'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100},),
 'rank_test_score': array([1]),
 'split0_test_score': array([ 0.74821666]),
 'split0_train_score': array([ 0.97564995]),
 'split1_test_score': array([ 0.80089016]),
 'split1_train_score': array([ 0.95361201]),
 'split2_test_score': array([ 0.92876979]),
 'split2_train_score': array([ 0.93935856]),
 'split3_test_score': array([ 0.95540287]),
 'split3_train_score': array([ 0.94718634]),
 'split4_test_score': array([ 0.89083901]),
 'split4_train_score': array([ 0.94787374]),
 'split5_test_score': array([ 0.90926355]),
 'split5_train_score': array([ 0.94829775]),
 'split6_test_score': array([ 0.82520379]),
 'split6_train_score': array([ 0.94971417]),
 'std_fit_time': array([ 1.79167576]),
 'std_score_time': array([ 0.02970254]),
 'std_test_score': array([ 0.0809713]),
 'std_train_score': array([ 0.0105566])}

As you can see, doing the np.mean of all the test_score it gives you a value approximately of 0.8655122606479532 while the 'mean_test_score' is 0.83490629

Thanks for you help, Leonardo.

Isn't the array with all the 'split0_test_score' , 'split1_test_score'.. and so on? — Dipe, Jul 06 '17 at 11:47
I am guessing`scoring=score_auc` is a custom scoring function you provide since it is not one of the allowed values. Is it weighted? — mkaran, Jul 06 '17 at 11:54
It's the sample score_auc evaluation def score_auc(estimator, X, y): probas = estimator.predict_proba(X) fpr, tpr, thresholds = roc_curve(y, probas[:, 0], pos_label=1) return auc(fpr, tpr) — Dipe, Jul 06 '17 at 12:02
Can you print the size of the folds? Run `print([(len(train), len(test)) for train, test in gkf.split(X, groups=patients)])` — Johannes, Jul 06 '17 at 12:13
Yes of course! [(41835, 24377), (56229, 9983), (56581, 9631), (58759, 7453), (60893, 5319), (60919, 5293), (62056, 4156)] — Dipe, Jul 06 '17 at 12:17

score 7 · Accepted Answer · answered Jul 06 '17 at 13:21

I will post this as a new answer since its so much code:

The test and train scores of the folds are: (taken from the results you posted in your question)

test_scores = [0.74821666,0.80089016,0.92876979,0.95540287,0.89083901,0.90926355,0.82520379]
train_scores = [0.97564995,0.95361201,0.93935856,0.94718634,0.94787374,0.94829775,0.94971417]

The amount of training samples in those folds are: (taken from the output of print([(len(train), len(test)) for train, test in gkf.split(X, groups=patients)]))

train_len = [41835, 56229, 56581, 58759, 60893, 60919, 62056]
test_len = [24377, 9983, 9631, 7453, 5319, 5293, 4156]

Then the test- and train-means with the amount of training samples per fold as weight is:

train_avg = np.average(train_scores, weights=train_len)
-> 0.95064898361714389
test_avg = np.average(test_scores, weights=test_len)
-> 0.83490628649308296

So this is exactly the value sklearn gives you. It is also the correct mean accuracy of your classification. The mean of the folds is incorrect in that it depends on the somewhat arbitrary splits/folds you chose.

So in concusion, both explanations were indeed identical and correct.

Bharath M Shetty · Answer 2 · 2017-07-06T12:08:16.337

3

If you see the original code of GridSearchCV in their github repository, they dont use np.mean() instead they use np.average() with weights. Hence the difference. Here's their code:

n_splits = 3
test_sample_counts = np.array(test_sample_counts[:n_splits],
                                    dtype=np.int)
weights = test_sample_counts if self.iid else None
means = np.average(test_scores, axis=1, weights=weights)
stds = np.sqrt(np.average((test_scores - means[:, np.newaxis]) 
                               axis=1, weights=weights))

 cv_results = dict()
 for split_i in range(n_splits):
        cv_results["split%d_test_score" % split_i] = test_scores[:,
                                                              split_i]
 cv_results["mean_test_score"] = means        
 cv_results["std_test_score"] = stds

In case you want to know more about the difference between them take a look Difference between np.mean() and np.average()

edited Jul 06 '17 at 12:08

answered Jul 06 '17 at 11:55

Bharath M Shetty

30,075
6
57
108

Indeed, but `np.average` and `np.mean` in the provided example give the same resilts. – mkaran Jul 06 '17 at 11:58
They use weights parameter. – Bharath M Shetty Jul 06 '17 at 11:59
Yeap that would be the difference here. – mkaran Jul 06 '17 at 11:59
As far as I understand the weight parameter is the sample count in the particular fold/split. So I guess this comes down t the same explanation as my answer? – Johannes Jul 06 '17 at 12:01
@Johannes I believe that's the case! +1 to your answer too. Both answers actually contribute to this question for different reasons. – mkaran Jul 06 '17 at 12:12

Johannes · Answer 3 · 2017-07-06T11:53:13.197

I suppose the reason for the different means are different weighting factors in the mean calculation.

The mean_test_score that sklearn returns is the mean calculated on all samples where each sample has the same weight.

If you calculate the mean by taking the mean of the folds (splits), then you only get the same results if the folds are all of equal size. If they are not, then all samples of larger folds will automatically have a smaller impact on the mean of the folds than smaller folds, and the other way around.

Small numeric example:

mean([2,3,5,8,9]) = 5.4 # mean over all samples ('mean_test_score')

mean([2,3,5]) = 3.333 # mean of fold 1
mean([8,9]) = 8.5 # mean of fold 2

mean(3.333, 8.5) = 5.91 # mean of means of folds

5.4 != 5.91

What is the meaning of 'mean_test_score' in cv_result?

3 Answers3

Linked

Related