2

I have a classification problem where I want to get the roc_auc value using cross_validate in sklearn. My code is as follows.

from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
y = iris.target

from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state = 0, class_weight="balanced")

from sklearn.model_selection import cross_validate
cross_validate(clf, X, y, cv=10, scoring = ('accuracy', 'roc_auc'))

However, I get the following error.

ValueError: multiclass format is not supported

Please note that I selected roc_auc specifically is that it supports both binary and multiclass classification as mentioned in: https://scikit-learn.org/stable/modules/model_evaluation.html

I have binary classification dataset too. Please let me know how to resolve this error.

I am happy to provide more details if needed.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
EmJ
  • 4,398
  • 9
  • 44
  • 105

1 Answers1

6

By default multi_class='raise' so you need explicitly to change this.

From the docs:

multi_class {‘raise’, ‘ovr’, ‘ovo’}, default=’raise’

Multiclass only. Determines the type of configuration to use. The default value raises an error, so either 'ovr' or 'ovo' must be passed explicitly.

'ovr':

Computes the AUC of each class against the rest [3] [4]. This treats the multiclass case in the same way as the multilabel case. Sensitive to class imbalance even when average == 'macro', because class imbalance affects the composition of each of the ‘rest’ groupings.

'ovo':

Computes the average AUC of all possible pairwise combinations of classes [5]. Insensitive to class imbalance when average == 'macro'.


Solution:

Use make_scorer (docs):

from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
y = iris.target

from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state = 0, class_weight="balanced")

from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score

myscore = make_scorer(roc_auc_score, multi_class='ovo',needs_proba=True)

from sklearn.model_selection import cross_validate
cross_validate(clf, X, y, cv=10, scoring = myscore)

seralouk
  • 30,938
  • 9
  • 118
  • 133
  • 2
    This gives `AxisError: axis 1 is out of bounds for array of dimension 1` in `cross_validate`. You need to add `needs_proba=True` in the definition of `myscore`. Additionally, it's good practice to shuffle the data first. – desertnaut Mar 24 '20 at 11:38
  • @makis Thank you very much for the answer. However, I get the following error `TypeError: roc_auc_score() got an unexpected keyword argument 'multi_class'`. Is there a way to resolve this? :) – EmJ Mar 24 '20 at 13:52
  • @makis One further question. If i want to use this for binary classification, what is the change I should do? Thank you :) – EmJ Mar 24 '20 at 14:05
  • 1
    In sklearn 0.22.2 the function `roc_auc_score` has this argument. Make sure you upgrade your package. see: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html – seralouk Mar 24 '20 at 15:30
  • @makis Thanks a lot. If I were to use your code for binary clsiification, is it correct if I make the scorer without `multi_class` parameter? i.e. `myscore = make_scorer(roc_auc_score, needs_proba=True)`. Looking forward to hearing from you :) – EmJ Mar 25 '20 at 12:46
  • @makis Thanks a lot. I followed your answer in this question: https://stackoverflow.com/questions/45641409/computing-scikit-learn-multiclass-roc-curve-with-cross-validation-cv and I got different results. Since there is limited space in the comments, I posted it as a different question: https://stackoverflow.com/questions/60849396/how-to-get-roc-auc-for-binary-classification-in-sklearn Please let me know your thoughts on this :) – EmJ Mar 25 '20 at 13:03
  • consider upvoting my answer. I am going to have a look at your newly posted question asap – seralouk Mar 25 '20 at 14:20