0

I have this excel file which has predicted value and probability from my model I need to plot the ROC curve for this multiclass from this excel that is for Intent1,2,3(there are about 70 intents as such).

Utterence   Intent_1    Conf_intent1 Intent_2   Conf_Intent2  ...so on 
Uttr 1      Intent1       0.86        Intent2       0.45         
Uttr2       Intent3       0.47        Intent1       0.76        
Uttr3       Intent1       0.70        Intent3       0.20         
Uttr4       Intent3       0.42        Intent2       0.67         
Uttr5       Intent1       0.70        Intent3       0.55             
Note: Probability is done on absolute scoring so will not add to 1 for particular utterence the highest probability will be predicted

This is my code for which I am getting the error:

import pandas as pd 
import numpy as np 
from sklearn.metrics import multilabel_confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from itertools import cycle
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp

#reading the input file
df = pd.read_excel('C:\\test.xlsx')

#Converting the columns to array

predicted = df['Predicted'].to_numpy()
Score = df['Probability'].to_numpy()

labels=df['Predicted'].unique();mcm = multilabel_confusion_matrix(actual, predicted, labels=labels)


predicted = label_binarize(predicted, classes=labels)
n_class = predicted.shape[0]
print(n_class)

print(type(predicted))

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_class):
    fpr[i], tpr[i], _ = roc_curve(predicted[:, i], Score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Plot of a ROC curve for a specific class
for i in range(n_class):
    plt.figure()
    plt.plot(fpr[i], tpr[i], label='ROC curve (area = %0.2f)' % roc_auc[i])
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

But i am getting the error:

File "roc.py", line 61, in <module>
    fpr[i], tpr[i], _ = roc_curve(predicted[:, i], Score[:, i])
IndexError: too many indices for array

Then i removed the [:,1] from both predicted and score

raise ValueError("{0} format is not supported".format(y_type))
ValueError: multilabel-indicator format is not supported

Can anyone help me on this?

think-maths
  • 917
  • 2
  • 10
  • 28

1 Answers1

1

There are several changes you need to make in your code:

  • First, from a stats point of view: ROC AUC is measured by comparing the predicted probability score to the actual label. You are comparing the predicted probability to the predicted label. This makes no sense, as they are obviously closely related..

  • Second, from a code point of view: n_classes should not measure the number of observations, but the number of classes. As a result, you should do n_class = predicted.shape[1]

I put this answer together, trying to stick to your code as much as possible:

actual = df['Actual'].to_numpy()
Score = df[['Conf_intent1','Conf_intent2','Conf_intent3']].to_numpy()

labels=df['Actual'].unique()

actual = label_binarize(actual, classes=labels)
n_class = actual.shape[1]


# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_class):
    fpr[i], tpr[i], _ = roc_curve(actual[:, i], Score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Plot of a ROC curve for a specific class
for i in range(n_class):
    plt.figure()
    plt.plot(fpr[i], tpr[i], label='ROC curve (area = %0.2f)' % roc_auc[i])
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()
MaximeKan
  • 4,011
  • 11
  • 26
  • As per concept of ROC and what you said I changed the output of my model as edited above and now I have probability of each individual intent next to it for every intent per Utterence. How this is be plotted to each individual class that is "Intent" here in each case – think-maths Sep 25 '19 at 06:41
  • How does this not answer your question? This creates 3 ROC curves, one for each class – MaximeKan Sep 25 '19 at 11:47
  • My model is onevsall and for every utterence it gives the probabilities of all the Intents as shown above so to plot ROC for each class, I have to iterate over all the columns which has intents and their probability then ```df[acual].unique``` won't give the number of classes but the number of columns with 'Intent' will, isn't it? – think-maths Sep 25 '19 at 12:20
  • df["actual"].unique() will return an array of the labels. In this example, it will be ["Intent1", "Intent2", "Intent3"], and then you get one vs rest for each of these labels. I believe this is what you want – MaximeKan Sep 25 '19 at 13:24
  • you mean to say ```labels=df[["Intent_1","Intent_2","Intent_3....]]``` but in for loop for plotting how will they be mapped with respective probability for example content of Utterence1 in Intent_1 column has probability taken from Conf_Intent1 and so on for other intents as well for Utterence1. Hope I made my question clear here. Thanks in advance – think-maths Sep 25 '19 at 14:17