Why roc_auc produces weird results in sklearn?

Question

I have a binary classification problem where I use the following code to get my weighted avarege precision, weighted avarege recall, weighted avarege f-measure and roc_auc.

df = pd.read_csv(input_path+input_file)

X = df[features]
y = df[["gold_standard"]]

clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
scores = cross_validate(clf, X, y, cv=k_fold, scoring = ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'))

print("accuracy")
print(np.mean(scores['test_accuracy'].tolist()))
print("precision_weighted")
print(np.mean(scores['test_precision_weighted'].tolist()))
print("recall_weighted")
print(np.mean(scores['test_recall_weighted'].tolist()))
print("f1_weighted")
print(np.mean(scores['test_f1_weighted'].tolist()))
print("roc_auc")
print(np.mean(scores['test_roc_auc'].tolist()))

I got the following results for the same dataset with 2 different feature settings.

Feature setting 1 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):  
0.6920, 0.6888, 0.6920, 0.6752, 0.7120

Feature setting 2 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6806  0.6754  0.6806  0.6643  0.7233

So, we can see that in feature setting 1 we get good results for 'accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted' compared to feature setting 2.

However, when it comes to 'roc_auc' feature setting 2 is better than feature setting 1. I found this weird becuase every other metric was better with feature setting 1.

On one hand, I suspect that this happens since I am using weighted scores for precision, recall and f-measure and not with roc_auc. Is it possible to do weighted roc_auc for binary classification in sklearn?

What is the real problem for this weird roc_auc results?

score 3 · Accepted Answer · answered Mar 30 '20 at 16:01

It is not weird, because comparing all these other metrics with AUC is like comparing apples to oranges.

Here is a high-level description of the whole process:

Probabilistic classifiers (like RF here) produce probability outputs p in [0, 1].
To get hard class predictions (0/1), we apply a threshold to these probabilities; if not set explicitly (like here), this threshold is implicitly taken to be 0.5, i.e. if p>0.5 then class=1, else class=0.
Metrics like accuracy, precision, recall, and f1-score are calculated over the hard class predictions 0/1, i.e after the threshold has been applied.
In contrast, AUC measures the performance of a binary classifier averaged over the range of all possible thresholds, and not for a particular threshold.

So, it can certainly happen, and it can indeed lead to confusion among new practitioners.

The second part of my answer in this similar question might be helpful for more details. Quoting:

According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.

Thanks a lot for the clarification. Thant makes sense. I want to identify the most suitable feature setting for my problem. In that case, would you recommend me to go with `roc_auc` scores and select `feature setting 2` or vice versa? Please let me know your thoughts. Thank you :) — EmJ, Mar 30 '20 at 16:13
@EmJ I think the very last part of the linked thread implies an answer ;) — desertnaut, Mar 30 '20 at 16:14
hehe, ya I figured it out. Thank you for the wonderful answer that you linked. I learnt a lot form it. I have one last question that I thought would be great to get your feedback. I noted that my `accuracy` is always similar to my `weighted recall` value. This makes me really worried. Do you know why this happens? Thank you :) — EmJ, Mar 30 '20 at 16:21
@EmJ sorry, not quite sure - check the definition in the docs — desertnaut, Mar 30 '20 at 16:22

Why roc_auc produces weird results in sklearn?

1 Answers1

Linked