roc_curve in sklearn: why doesn't it work correctly?

Question

I'm solving a task of multi-class classification and want to estimate the result using roc curve in sklearn. As I know, it allows to plot a curve in this case if I set a positive label. I tried to plot a roc curve using positive label and got strange results: the bigger the "positive label" of the class was, the closer to the top left corner the roc curve became. Then I plot a roc curve with a previous binary labeling of the arrays. These 2 plots were different! I think that the second one was built correctly, but in case of binary classes the plot has only 3 points and this is not informative.

I want to understand, why roc curve for binary classes and roc curve with "positive label" look different and how to plot roc curve with positive label correctly.

Here is the code:

from sklearn.metrics import roc_curve, auc
y_pred = [1,2,2,2,3,3,1,1,1,1,1,2,1,2,3,2,2,1,1]
y_test = [1,3,2,2,1,3,2,1,2,2,1,2,2,2,1,1,1,1,1]
fp, tp, _ = roc_curve(y_test, y_pred, pos_label = 2)
from sklearn.preprocessing import label_binarize
y_pred = label_binarize(y_pred, classes=[1, 2, 3])
y_test = label_binarize(y_test, classes=[1, 2, 3])
fpb, tpb, _b = roc_curve(y_test[:,1], y_pred[:,1])
plt.plot(fp, tp, 'ro-', fpb, tpb, 'bo-', alpha = 0.5)
plt.show()
print('AUC with pos_label', auc(fp,tp))
print('AUC binary variant', auc(fpb,tpb))

This is the example of the plot

Red curve represents roc_curve with pos_label, blue curve represents roc_curve in "binary case"

Seems you are in a multi-class setting (more than 2 classes), and not a multi-label one (a single instance can belong to more than one class) - edited question and tags. — desertnaut, Jul 11 '19 at 15:18
@desertnaut you're right, I have 3 different classes. As I know, setting the pos_label will allow to build "one vs all" curve if I have many classes. The same as roc curve for 2 binary classes. — svetlana, Jul 11 '19 at 15:26
Generally speaking, keep in mind that ROC curves need the *probabilistic* predictions in `y_pred`, and not the "hard" classes. — desertnaut, Jul 11 '19 at 15:35
@desertnaut this case will undoubtedly more preferred for ROC curve, but it's written in sklearn documentation that we can use non-threshold measure: "Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers)" — svetlana, Jul 11 '19 at 15:50
Yes - the emphasis is on **not-thresholded**; but your `y_pred` are indeed *thresholded*, thus providing the "hard" class, in contrast with the measures you list. — desertnaut, Jul 11 '19 at 15:55
@desertnaut Thank you! I understood. Could you give me please some advise, which metrics I can use to evaluate the quality of such a multi-class classification with "hard" classes? It will be enough to plot some statistics for every class separately — svetlana, Jul 11 '19 at 16:04

score 3 · Accepted Answer · answered Jul 11 '19 at 16:20

As explained in the comments, ROC curves are not suitable for evaluating thresholded predictions (i.e. hard classes), as your y_pred; moreover, when using AUC, it is useful to keep in mind some limitations that are not readily apparent to many practitioners - see the last part of own answer in Getting a low ROC AUC score but a high accuracy for more details.

Could you give me please some advise, which metrics I can use to evaluate the quality of such a multi-class classification with "hard" classes?

The most straightforward approach would be the confusion matrix and the classification report readily provided by scikit-learn:

from sklearn.metrics import confusion_matrix, classification_report

y_pred = [1,2,2,2,3,3,1,1,1,1,1,2,1,2,3,2,2,1,1]
y_test = [1,3,2,2,1,3,2,1,2,2,1,2,2,2,1,1,1,1,1]

print(classification_report(y_test, y_pred)) # caution - order of arguments matters!
# result:
             precision    recall  f1-score   support

          1       0.56      0.56      0.56         9
          2       0.57      0.50      0.53         8
          3       0.33      0.50      0.40         2

avg / total       0.54      0.53      0.53        19

cm = confusion_matrix(y_test, y_pred) # again, order of arguments matters
cm
# result:
array([[5, 2, 2],
       [4, 4, 0],
       [0, 1, 1]], dtype=int64)

From the confusion matrix, you can extract other quantities of interest, like true & false positives per class etc - for details, please see own answer in How to get precision, recall and f-measure from confusion matrix in Python

Thank you! As for plots - can precision_recall_curve be representative or I'll face the same problems as with ROC curve? — svetlana, Jul 11 '19 at 16:38
@svetlana you are very welcome; precision-recall curve also needs non-thresholded predictions — desertnaut, Jul 11 '19 at 16:43

roc_curve in sklearn: why doesn't it work correctly?

1 Answers1

Linked