What I need is to:
- Apply a logistic regression classifier
- Report the per-class ROC using the AUC.
- Use the estimated probabilities of the logistic regression to guide the construction of the ROC.
- 5fold cross validation for the training your model.
For this, my approach was to use this really nice tutorial:
From his idea and method I simply changed how I obtain the raw data which I am getting like this:
df = pd.read_csv(
filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
header=None,
sep=',')
df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end
df.tail()
# split data table into data X and class labels y
X = df.iloc[:,0:4].values
Y = df.iloc[:,4].values
Them I simply run the code. If I try to run for metrics like accuracy
or
balanced_accuracy
everything works fine (even with many other metrics). My problem is that when I try to run with the metric roc_auc
I get the error:
"ValueError: Only one class present in y_true. ROC AUC score is not defined in that case."
This error have been discussed here1, here2, here3, and here4. However, I was not able to use any of the "solution"/work arounds provided by them to solve my problem.
My whole code is:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode
from sklearn.preprocessing import StandardScaler
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'qt')
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
df = pd.read_csv(
filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
header=None,
sep=',')
df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end
df.tail()
# split data table into data X and class labels y
X = df.iloc[:,0:4].values
Y = df.iloc[:,4].values
#print(X)
#print(Y)
seed = 7
# prepare models
models = []
models.append(('LR', LogisticRegression()))
# evaluate each model in turn
results = []
names = []
scoring = 'roc_auc'
for name, model in models:
kfold = model_selection.KFold(n_splits=5, random_state=seed)
cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()