4

I'm trying to calculate roc_curve but I got this error message

Traceback (most recent call last):
  File "script.py", line 94, in <module>
    fpr, tpr, _ = roc_curve(y_validate, status[:,1])
  File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 501, in roc_curve
    y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
  File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 308, in _binary_clf_curve
    raise ValueError("Data is not binary and pos_label is not specified")
ValueError: Data is not binary and pos_label is not specified

My code

status = rf.predict_proba(x_validate)
fpr, tpr, _ = roc_curve(y_validate, status[:,1]) //error generated here
roc_auc = auc(fpr, tpr)
print roc_auc

P.S: Not really understand this solution (ValueError: Data is not binary and pos_label is not specified) because it seems not really related.

halfer
  • 19,824
  • 17
  • 99
  • 186
Nurdin
  • 23,382
  • 43
  • 130
  • 308

1 Answers1

9

For the calculation of an ROC curve to be valid, you have to specify the label that you are treating as the "true" or "positive" label. Scikit-learn assumes that data given to it will always have labels 0 and 1 (in your case in the variable y_validate), with one of them arbitrarily chosen as the positive label (I don't exactly know how - I'm sure you can dig in the source code and figure it out).

As specified in your error message - your data does not have this expected binary format. Even if your data is binary, but the labels are 'T' and 'F', it will throw this error. So according to the documentation for the roc_curve() function from scikit-learn, you need to specify exactly which string label to use as the "positive class". So if your labels were 'T' and 'F' in your y_validate variable, you would do: fpr, tpr, _ = roc_curve(y_validate, status[:,1], pos_label='T').

mprat
  • 2,451
  • 15
  • 33