Using ROC AUC score with Logistic Regression and Iris Dataset

Question

What I need is to:

Apply a logistic regression classifier
Report the per-class ROC using the AUC.
Use the estimated probabilities of the logistic regression to guide the construction of the ROC.
5fold cross validation for the training your model.

For this, my approach was to use this really nice tutorial:

From his idea and method I simply changed how I obtain the raw data which I am getting like this:

df = pd.read_csv(
    filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
    header=None, 
    sep=',')

df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end

df.tail()

# split data table into data X and class labels y
X = df.iloc[:,0:4].values
Y = df.iloc[:,4].values

Them I simply run the code. If I try to run for metrics like accuracy or
balanced_accuracy everything works fine (even with many other metrics). My problem is that when I try to run with the metric roc_auc I get the error:

"ValueError: Only one class present in y_true. ROC AUC score is not defined in that case."

This error have been discussed here1, here2, here3, and here4. However, I was not able to use any of the "solution"/work arounds provided by them to solve my problem.

My whole code is:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode
from sklearn.preprocessing import StandardScaler
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'qt')
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split


df = pd.read_csv(
    filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
    header=None, 
    sep=',')

df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end

df.tail()

# split data table into data X and class labels y
X = df.iloc[:,0:4].values
Y = df.iloc[:,4].values

#print(X)
#print(Y)


seed = 7

# prepare models
models = []
models.append(('LR', LogisticRegression()))

# evaluate each model in turn
results = []
names = []
scoring = 'roc_auc'
for name, model in models:
    kfold = model_selection.KFold(n_splits=5, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)



# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

It looks to me that one of the classes only has true positive cases. There is no way to plot an roc curve as it does not make sense in that case. Have you tried using [stratified k-fold cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)? That MIGHT help. — razdi, May 02 '19 at 00:53

Venkatachalam · Accepted Answer · 2019-05-03T11:17:52.883

The iris dataset is usually ordered with respect to classes. Hence, when you split without shuffling, the test dataset might get only one class.

One simple solution would be using shuffle parameter.

kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed)

Even then roc_auc does not support multi-class format directly (iris - dataset has three classes).

Go through this link to know more information about how to use roc_auc for multi-class situation.

score 1 · Answer 2 · answered May 03 '19 at 10:16

Ideally, for classification tasks, a stratified-k-fold iteration is used which preserves the balance of classes in train and test folds.

In the scikit-learn cross_val_score, the default behaviour of cross-validation is dependent on the task. The documentation says:-

cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs for cv are:
None, to use the default 3-fold cross validation,

integer, to specify the number of folds in a (Stratified)KFold, CV splitter,

An iterable yielding (train, test) splits as arrays of indices.

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

Now the iris dataset is a set of 150 samples which are ordered by classes (Iris setosa, Iris virginica and Iris versicolor). So using a simple K-fold iterator of 5 folds will treat first 120 samples in training set and last 30 samples in test set. Last 30 samples belong to the single Iris versicolor class.

So if you do not have any specific reason to use the KFold then you can do this:

cv_results = model_selection.cross_val_score(model, X, Y, cv=5, scoring=scoring)

But now comes the issue of scoring. You are using 'roc_auc' which is only defined for binary classification tasks. So either choose a different metric in place of roc_auc, or else specify which class you want to treat as positive and which other classes as negative.

Using ROC AUC score with Logistic Regression and Iris Dataset

2 Answers2