0

I have three binary classification models and I arrived up to the following point trying to assemble them into a final comparative ROC plot.

import pandas as pd
import numpy as np
import sklearn.metrics as metrics

y_test = ... # a numpy array containing the test values
dfo    = ... # a pd.DataFrame containing the model predictions
dfroc = dfo[['SVM',
             'RF',
             'NN']].apply(lambda y_pred: metrics.roc_curve(y_test[:-1], y_pred[:-1])[0:2], 
                          axis=0, result_type='reduce')
print(dfroc)
dfroc_auc = dfroc.apply(lambda x: metrics.auc(x[0], x[1]))
print(dfroc_auc)

Which outputs the following (where dfroc and dfroc_auc are of type pandas.core.series.Series):

SVM     ([0.0, 0.016666666666666666, 1.0], [0.0, 0.923...
RF      ([0.0, 0.058333333333333334, 1.0], [0.0, 0.769...
NN      ([0.0, 0.06666666666666667, 1.0], [0.0, 1.0, 1...
dtype: object

SVM     0.953205
RF      0.855449
NN      0.966667
dtype: float64

To be able to plot them as a comparative ROC I'd need to convert these into the following pivoted structure as dfroc pd.DataFrame ... how can this pivotization be done?

      model   fpr       tpr
1     SVM     0.0       0.0        
2     SVM     0.16666   0.923
3     SVM     1.0       ...
4     RF      0.0       0.0       
5     RF      0.05833   0.769
6     RF      1.0       ... 
7     NN      ...       ...

And then for the plotting and following directions from How to plot ROC curve in Python would be something like:

import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
dfroc.plot(label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
desertnaut
  • 57,590
  • 26
  • 140
  • 166
SkyWalker
  • 13,729
  • 18
  • 91
  • 187

1 Answers1

2

Not the ideal structure to work on, but assuming you have something as follows:

s = pd.Series({'SVC':([0.0, 0.016, 1.0], [0.0, 0.923, 0.5], [0.3, 0.4, 0.9]),
               'RF': ([0.0, 0.058, 1.0], [0.0, 0.923, 0.2], [0.5, 0.3, 0.9]),
               'NN': ([0.0, 0.06,  1.0], [0.0, 0.13, 0.4], [0.2, 0.4, 0.9])})

You could define a function to compute the TPR and FPR and return a dataframe with the specified structure:

def tpr_fpr(g):
    model, cm = g
    cm = np.stack(cm.values)
    diag = np.diag(cm)
    FP = cm.sum(0) - diag   
    FN = cm.sum(1) - diag
    TP = diag
    TN = cm.sum() - (FP + FN + TP)
    TPR = TP/(TP+FN)
    FPR = FP/(FP+TN)
    return pd.DataFrame({'model':model,
                         'TPR':TPR, 
                         'FPR':FPR})

And from the groupby on the first level, and apply the above function to each group:

out = pd.concat([tpr_fpr(g) for g in s.explode().groupby(level=0)])

print(out)

  model       TPR       FPR
0    NN  0.000000  0.098522
1    NN  0.245283  0.179688
2    NN  0.600000  0.880503
0    RF  0.000000  0.177117
1    RF  0.821906  0.129804
2    RF  0.529412  0.550206
0   SVC  0.000000  0.099239
1   SVC  0.648630  0.159021
2   SVC  0.562500  0.615006
yatu
  • 86,083
  • 12
  • 84
  • 139
  • thanks for the great answer, however why should we compute the FNR and TPR manually? I showed how to use apply to compute those using the `sklearn.metrics` module ... – SkyWalker Jun 23 '20 at 08:35
  • Well, performance-wise there won't be much of a difference. In both cases you need groupby + apply. You could easily adapt the function to use metrics @sky – yatu Jun 23 '20 at 08:42
  • I just had something similar from the past and adapted it :D @sky – yatu Jun 23 '20 at 08:44