26

I am doing a k-fold XV on an existing dataframe, and I need to get the AUC score. The problem is - sometimes the test data only contains 0s, and not 1s!

I tried using this example, but with different numbers:

import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 0, 0])
y_scores = np.array([1, 0, 0, 0])
roc_auc_score(y_true, y_scores)

And I get this exception:

ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

Is there any workaround that can make it work in such cases?

bloop
  • 421
  • 1
  • 5
  • 8
  • The cause of this may be incorrectly using cross-validation when each fold is not representative of the greater sample population. – Kermit Dec 29 '20 at 21:39

5 Answers5

23

You could use try-except to prevent the error:

import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 0, 0])
y_scores = np.array([1, 0, 0, 0])
try:
    roc_auc_score(y_true, y_scores)
except ValueError:
    pass

Now you can also set the roc_auc_score to be zero if there is only one class present. However, I wouldn't do this. I guess your test data is highly unbalanced. I would suggest to use stratified K-fold instead so that you at least have both classes present.

Dat Tran
  • 2,368
  • 18
  • 25
  • why would you suggest against this? I am interested in using roc_auc_score as a metric for a CNN and if my batch sizes are on the smaller side the unbalanced nature of my data comes out. Not sure how to use Kfold on a data generator unfortunately – MikeDoho Dec 17 '18 at 12:30
  • There are many cases when you need a value, e.g. batch training or evaluation. I do not know theoretically correct answer, but 0.75 should a reasonable returned value. – Dmitry Konovalov Mar 26 '19 at 03:27
3

As the error notes, if a class is not present in the ground truth of a batch,

ROC AUC score is not defined in that case.

I'm against either throwing an exception (about what? This is the expected behaviour) or returning another metric (e.g. accuracy). The metric is not broken per se.

I don't feel like solving a data imbalance "issue" with a metric "fix". It would probably be better to use another sampling, if possibile, or just join multiple batches that satisfy the class population requirement.

Diego Ferri
  • 2,657
  • 2
  • 27
  • 35
1

I am facing the same problem now, and using try-catch does not solve my issue. I developed the code below in order to deal with that.

import pandas as pd
import numpy as np

class KFold(object):

    def __init__(self, folds, random_state=None):

        self.folds = folds

        self.random_state = random_state

    def split(self, x, y):

        assert len(x) == len(y), 'x and y should have the same length'

        x_, y_ = pd.DataFrame(x), pd.DataFrame(y)

        y_ = y_.sample(frac=1, random_state=self.random_state)

        x_ = x_.loc[y_.index]

        event_index, non_event_index = list(y_[y == 1].index), list(y_[y == 0].index)

        assert len(event_index) >= self.folds, 'number of folds should be less than the number of rows in x'

        assert len(non_event_index) >= self.folds, 'number of folds should be less than number of rows in y'

        indexes = []

        #
        #
        #
        step = int(np.ceil(len(non_event_index) / self.folds))

        start, end = 0, step

        while start < len(non_event_index):

            train_fold = set(non_event_index[start:end])

            valid_fold = set([k for k in non_event_index if k not in train_fold])

            indexes.append([train_fold, valid_fold])

            start, end = end, min(step + end, len(non_event_index))


        #
        #
        #
        step = int(np.ceil(len(event_index) / self.folds))

        start, end, i = 0, step, 0

        while start < len(event_index):

            train_fold = set(event_index[start:end])

            valid_fold = set([k for k in event_index if k not in train_fold])

            indexes[i][0] = list(indexes[i][0].union(train_fold))

            indexes[i][1] = list(indexes[i][1].union(valid_fold))

            indexes[i] = tuple(indexes[i])

            start, end, i = end, min(step + end, len(event_index)), i + 1

        return indexes 

I just wrote that code and I did not tested it exhaustively. It was tested only for binary categories. Hope it be useful yet.

0

You can increase the batch-size from e.g. from 32 to 64, you can use StratifiedKFold or StratifiedShuffleSplit. If the error still occurs, try shuffeling your data e.g. in your DataLoader.

A.A.
  • 145
  • 13
-6

Simply modify the code with 0 to 1 make it work

import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 1, 0, 0])
y_scores = np.array([1, 0, 0, 0])
roc_auc_score(y_true, y_scores)

I believe the error message has suggested that only one class in y_true (all zero), you need to give 2 classes in y_true.

Shark Deng
  • 960
  • 9
  • 26