Why is cross_val_predict not appropriate for measuring the generalisation error?

Question

When I train a SVC with cross validation,

y_pred = cross_val_predict(svc, X, y, cv=5, method='predict')

cross_val_predict returns one class prediction for each element in X, so that y_pred.shape = (1000,) when m=1000. This makes sense, since cv=5 and therefore the SVC was trained and validated 5 times on different parts of X. In each of the five validations, predictions were made for one fifth of the instances (m/5 = 200). Subsequently the 5 vectors, containing 200 predictions each, were merged to y_pred.

With all of this in mind it would be reasonable for me to calculate the overall accuracy of the SVC using y_pred and y.

score = accuracy_score(y, y_pred)

But (!) the documentation of cross_val_predict states:

The result of cross_val_predict may be different from those obtained using cross_val_score as the elements are grouped in different ways. The function cross_val_score takes an average over cross-validation folds, whereas cross_val_predict simply returns the labels (or probabilities) from several distinct models undistinguished. Thus, cross_val_predict is not an appropriate measure of generalisation error.

Could someone please explain in other words, why cross_val_predict is not appropriate for measuring the generalisation error e.g. via accuracy_score(y, y_pred)?

Edit:

I first assumed that with cv=5 in each of the 5 validations predicitons would be made for all instances of X. But this is wrong, predictions are only made for 1/5 of the instances of X per validation.

Szymon Maszke · Accepted Answer · 2019-03-06T07:50:18.590

cross_val_score vs cross_val_predict

Differences between cross_val_predict and cross_val_score are described really clearly here and there is another link in there, so you can follow the rabbit.

In essence:

cross_val_score returns score for each fold
cross_val_predict makes out of fold predictions for each data point.

Now, you have no way of knowing which predictions in cross_val_predict came from which fold, hence you cannot calculate average per fold as cross_val_score does. You could average cross_val_score and accuracy_score of cross_val_predict, but average of averages is not equal to average, hence results would be different.

If one fold has a very low accuracy, it would impact the overall average more than in the case of averaged cross_val_predict.

Furthermore, you could group those seven data points differently and get different results. That's why there is information about grouping making the difference.

Example of difference between cross_val_score and cross_val_predict

Let's imagine cross_val_predict uses 3 folds for 7 data points and out of fold predictions are [0,1,1,0,1,0,1], while true targets are [0,1,1,0,1,1,0]. Accuracy score would be calculated as 5/7 (only the last two were badly predicted).

Now take those same predictions and split them into following 3 folds:

[0, 1, 1] - prediction and [0, 1, 1] target -> accuracy of 1 for first fold
[0, 1] - prediction and [0, 1] target -> perfect accuracy again
[0, 1] - prediction and [1, 0] target -> 0 accuracy

This is what cross_val_score does and would return a tuple of accuracies, namely [1, 1, 0]. Now, you can average this tuple and total accuracy is 2/3.

See? With the same data, you would get two different measures of accuracy (one being 5/7 and the other 2/3).

In both cases, grouping changed total accuracy you would obtain. Classifier errors are more severe with cross_val_score, as each errors influences the group's accuracy more than it would influence the average accuracy of all predictions (you can check it on your own).

Both could be used for evaluating your model's performance on validation set though and I see no contraindication, just different behavior (fold errors not being as severe).

Why neither is a measure of generalization

If you fit your algorithm according to cross validation schemes, you are performing data leakage (fine-tuning it for the train and validation data). In order to get a sense of generalization error, you would have to leave a part of your data out of cross validation and training.

You may want to perform double cross validation or just leave test set out to get how well your model actually generalizes.

Of course, in addition to the validation set, you also need a test set in order to assess how well the model generalizes. Let me rephrase my question. If I have an SVC A with C=1 and an SVC B with C=100, I would like to know: Can I use the predictions `y_pred` (computed via `cros_val_predict`) to calculate performance statistics such as accuracy for my SVCs A and B to evaluate whether C=1 or C=100 is the better hyperparameter for my data? — zwithouta, Mar 05 '19 at 21:03
It depends whether the procedure described in my answer is fine for you. You would get different results from `cross_val_score` averaged and `cross_val_predict` averaged, as some errors would be weighted differently. All in all yes, you could do this and you will be able to evaluate which hyperparameters set is probably better (I see no contraindications here, someone correct me if I'm wrong). — Szymon Maszke, Mar 05 '19 at 21:34
What exactly do you mean by "`cross_val_predict` averaged"? To average the output of `cross_val_predict` would mean averaging a vector containing class predictions, which makes no sense to me. Averaging the accuracy-score computed based on this prediction vector makes no sense to me either, since it is a single value. — zwithouta, Mar 05 '19 at 21:52
Averaging `accuracy_score` of `cross_val_predict` and target, that's what I meant. Each prediction is made on the fold on which your classifier is not trained, hence you get so-called __out of fold predictions__ with shape equal to target (`y`) (and that is the behavior described in your __EDIT:__). Yes, you would get one value, that's what `accuracy_score` does and that's what you were after initially: `With all of this in mind it would be reasonable for me to calculate the overall accuracy of the SVC using y_pred and y.`, weren't you? — Szymon Maszke, Mar 05 '19 at 22:13
Are the accuracy score computed from the `cross_val_predict`-output via `accuracy_score`, and the accuracy score computed from averaging the `cross_val_score`-output , not identical? If not, I still don't understand why. — zwithouta, Mar 05 '19 at 22:31
You should look at the link I have provided about average of averages not being equal to an average. I have added an example in my answer though, is the difference clear now? — Szymon Maszke, Mar 06 '19 at 07:51
Thank you very much.The example really helped! I did the same calculation, but the individual cv groups always had exactly the same size. In this case, the average of the averages is equal to the average (at least thats what Ricardo stated in the link you send me). — zwithouta, Mar 06 '19 at 11:58

Why is cross_val_predict not appropriate for measuring the generalisation error?

1 Answers1

cross_val_score vs cross_val_predict

Example of difference between cross_val_score and cross_val_predict

Why neither is a measure of generalization