Reason of having high AUC and low accuracy in a balanced dataset

Question

Given a balanced dataset (size of both classes are the same), fitting it into an SVM model I yield a high AUC value (~0.9) but a low accuracy (~0.5).

I have totally no idea why would this happen, can anyone explain this case for me?

btw, my first thought is that you are miss leading the correct label. try to plot the roc curve, probably you wil notice that the AUC is guessing 0 as 1 vice versa.... — Alvaro Silvino, Jul 15 '16 at 04:40

normanius · Answer 1 · 2022-11-09T17:25:40.173

The ROC curve is biased towards the positive class. The described situation with high AUC and low accuracy can occur when your classifier achieves the good performance on the positive class (high AUC), at the cost of a high false negatives rate (or a low number of true negatives).

The question of why the training process resulted in a classifier with poor predictive performance is very specific to your problem/data and the classification methods used.

The ROC analysis tells you how well the samples of the positive class can be separated from the other class, while the prediction accuracy hints on the actual performance of your classifier.

About ROC analysis

The general context for ROC analysis is binary classification, where a classifier assigns elements of a set into two groups. The two classes are usually referred to as "positive" and "negative". Here, we assume that the classifier can be reduced to the following functional behavior:

def classifier(observation, t):
    if score_function(observation) <= t: 
        observation belongs to the "negative" class
    else:           
        observation belongs to the "positive" class

The core of a classifier is the scoring function that converts observations into a numeric value measuring the affinity of the observation to the positive class. Here, the scoring function incorporates the set of rules, the mathematical functions, the weights and parameters, and all the ingenuity that makes a good classifier. For example, in logistic regression classification, one possible choice for the scoring function is the logistic function that estimates the probability p(x) of an observation x belonging to the positive class.

In a final step, the classifier converts the computed score into a binary class assignment by comparing the score against a decision threshold (or prediction cutoff) t.

Given the classifier and a fixed decision threshold t, we can compute actual class predictions y_p for given observations x. To assess the capability of a classifier, the class predictions y_p are compared with the true class labels y_t of a validation dataset. If y_p and y_t match, we refer to as true positives TP or true negatives TN, depending on the value of y_p and y_t; or false positives FP or false negatives FN if y_p and y_t do not match.

We can apply this to the entire validation dataset and count the total number of TPs, TNs, FPs and FNs, as well as the true positive rate (TPR) and false positive rate rate (FPR), which are defined as follows:

TPR = TP / P = TP / (TP+FN) = number of true positives / number of positives
FPR = FP / N = FP / (FP+TN) = number of false positives / number of negatives

Note that the TPR is often referred to as the sensitivity, and FPR is equivalent to 1-specifity.

In comparison, the accuracy is defined as the ratio of all correctly labeled cases and the total number of cases:

accuracy = (TP+TN)/(Total number of cases) = (TP+TN)/(TP+FP+TN+FN)

Given a classifier and a validation dataset, we can evaluate the true positive rate TPR(t) and false positive rate FPR(t) for varying decision thresholds t. And here we are: Plotting FPR(t) against TPR(t) yields the receiver-operator characteristic (ROC) curve. Below are some sample ROC curves, plotted in Python using roc-utils^*.

Think of the decision threshold t as a final free parameter that can be tuned at the end of the training process. The ROC analysis offers means to find an optimal cutoff t* (e.g., Youden index, concordance, distance from optimal point).

Furthermore, we can examine with the ROC curve how well the classifier can discriminate between samples from the "positive" and the "negative" class:

Try to understand how the FPR and TPR change for increasing values of t. In the first extreme case (with some very small value for t), all samples are classified as "positive". Hence, there are no true negatives (TN=0), and thus FPR=TPR=1. By increasing t, both FPR and TPR gradually decrease, until we reach the second extreme case, where all samples are classified as negative, and none as positive: TP=FP=0, and thus FPR=TPR=0. In this process, we start in the top right corner of the ROC curve and gradually move to the bottom left.

In the case where the scoring function is able to separate the samples perfectly, leading to a perfect classifier, the ROC curve passes through the optimal point FPR(t)=0 and TPR(t)=1 (see the left figure below). In the other extreme case where the distributions of scores coincide for both classes, resulting in a random coin-flipping classifier, the ROC curve travels along the diagonal (see the right figure below).

Unfortunately, it is very unlikely that we can find a perfect classifier that reaches the optimal point (0,1) in the ROC curve. But we can try to get as close to it as possible.

The AUC, or the area under the ROC curve, tries to capture this characteristic. It is a measure for how well a classifier can discriminate between the two classes. It varies between 1. and 0. In the case of a perfect classifier, the AUC is 1. A classifier that assigns a random class label to input data would yield an AUC of 0.5.

^{* Disclaimer: I'm the author of roc-utils}

And for even more elaborate answers on this have also a look [here](https://stats.stackexchange.com/questions/90659)! — normanius, Feb 06 '18 at 22:54
In the binary case of e.g. ExtraTrees, is this still the case? Since the AUC would be the same for both the "positive" class and the "negative" class (wouldn't it?), then I would assume that (under the assumption of a balanced dataset) the AUC would give the same result as the accuracy — CutePoison, Jan 29 '19 at 09:59
Isn't TPR should be: `number of true positives / number of times the label was positive ?` — snowneji, Sep 04 '19 at 16:08
Correct me if I'm wrong but: lets say a threshold of 0.5 cannot seperate the two classes very well, but 0.7 does it perfectly. We would then have AUC=1 but (since most classifiers classify the class just with the highest "probability") you could end up with a low accuracy but a high AUC. If you change the classification of the class to a threshold of 0.7 instead of 0.5, should'nt we have a hight accuracy aswell (equal to one in this case)? — CutePoison, Nov 21 '19 at 15:39

score 3 · Answer 2 · answered Jul 15 '16 at 04:46

I guess you are miss reading the correct class when calculating the roc curve...
That will explain the low accuracy and the high (wrongly calculated) AUC.

It is easy to see that AUC can be misleading when used to compare two classifiers if their ROC curves cross. Classifier A may produce a higher AUC than B, while B performs better for a majority of the thresholds with which you may actually use the classifier. And in fact empirical studies have shown that it is indeed very common for ROC curves of common classifiers to cross. There are also deeper reasons why AUC is incoherent and therefore an inappropriate measure (see references below).

http://sandeeptata.blogspot.com/2015/04/on-dangers-of-auc.html

score 2 · Answer 3 · answered Jan 08 '21 at 14:23

Another simple explanation for this behaviour is that your model is actually very good - just its final threshold to make predictions binary is bad.

I came across this problem with a convolutional neural network on a binary image classification task. Consider e.g, that you have 4 samples with labels 0,0,1,1. Lets say your model creates continuous predictions for these four samples like so: 0.7, 0.75, 0.9 and 0.95.

We would consider this to be a good model, since high values (> 0.8) predict class 1 and low values (< 0.8) predict class 0. Hence, the ROC-AUC would be 1. Note how I used a threshold of 0.8. However, if you use a fixed and badly-chosen threshold for these predictions, say 0.5, which is what we sometimes force upon our model output, then all 4 sample predictions would be class 1, which leads to an accuracy of 50%.

Note that most models optimize not for accuracy, but for some sort of loss function. In my CNN, training for just a few epochs longer solved the problem.

Make sure that you know what you are doing when you transform a continuous model output into a binary prediction. If you do not know what threshold to use for a given ROC curve, have a look at Youden's index or find the threshold value that represents the "most top-left" point in your ROC curve.

This. If you are using a NN, you need to learn the threshold. — Union find, Jun 29 '21 at 00:26

score 0 · Answer 4 · answered Jun 21 '20 at 06:01

If this is happening every single time, may be your model is not correct. Starting from kernel you need to change and try the model with the new sets. Look the confusion matrix every time and check TN and TP areas. The model should be inadequate to detect one of them.

Reason of having high AUC and low accuracy in a balanced dataset

4 Answers4

Linked