2

Initialize the list of lists:

data = [[1.0, 0.635165,0.0], [1.0, 0.766586,1.0], [1.0, 0.724564,1.0],
        [1.0, 0.766586,1.0],[1.0, 0.889199,1.0],[1.0, 0.966586,1.0],
        [1.0, 0.535165,0.0],[1.0, 0.55165,0.0],[1.0, 0.525165,0.0],
        [1.0, 0.5595165,0.0] ]

Create the Pandas DataFrame:

df = pd.DataFrame(data, columns = ['y', 'prob','y_predict']) 

Print data frame.

print(df)

For this data-set, I want to find:

  1. Confusion matrix without using Sklearn
  2. Numpy array of TPR and FPR without using Sklearn, for plotting ROC.

How to do this in python?

Flavia Giammarino
  • 7,987
  • 11
  • 30
  • 40
Sahil Kamboj
  • 390
  • 2
  • 5
  • 16

4 Answers4

9

You can calculate the false positive rate and true positive rate associated to different threshold levels as follows:

import numpy as np

def roc_curve(y_true, y_prob, thresholds):

    fpr = []
    tpr = []

    for threshold in thresholds:

        y_pred = np.where(y_prob >= threshold, 1, 0)

        fp = np.sum((y_pred == 1) & (y_true == 0))
        tp = np.sum((y_pred == 1) & (y_true == 1))

        fn = np.sum((y_pred == 0) & (y_true == 1))
        tn = np.sum((y_pred == 0) & (y_true == 0))

        fpr.append(fp / (fp + tn))
        tpr.append(tp / (tp + fn))

    return [fpr, tpr]
Flavia Giammarino
  • 7,987
  • 11
  • 30
  • 40
0

... without sklearn python module:

  1. Confusion matrix without using Sklearn

    • You can use the pandas_ml

      from pandas_ml import ConfusionMatrix

    • You can build your math formula for the Confusion matrix
  2. About ROC you

    • see the python MatLab example solve on this issue;
    • can build your array and use the np and build your source code using the math formula.

You can understand more if you take a look at these articles:

logistic-regression-using-numpy - python examples regression;

what-is-the-roc-curve - theory;

roc-curve-part-2-numerical-example - python practice;

  • 1. I just need the function that can give me the NumPy array of TPR & FPR separately. I know how to plot ROC. I can use numpy.trapz(tpr_array, fpr_array) for the auc_score, if I had the required arrays. – Sahil Kamboj Apr 20 '20 at 12:41
  • Sorry, I don't know a specific function for these issues. The input data for arrays TPR an FRP give the graph for ROC. " I just need the function that can give me the NumPy array of TPR & FPR separately." - so you don't have input data and you don't know the theory. – Cătălin George Feștilă Apr 20 '20 at 14:25
  • it's ok, I got it. Thanks – Sahil Kamboj Apr 20 '20 at 15:00
  • no problem, give your vote and rate the answers for each response, this will help users to understand your problem into an area of answers. – Cătălin George Feștilă Apr 20 '20 at 15:40
0
import numpy as np

def calculate_cm(predicted, actual):
  fp = np.sum((y_pred == 1) & (y_true == 0))
  tp = np.sum((y_pred == 1) & (y_true == 1))

  fn = np.sum((y_pred == 0) & (y_true == 1))
  tn = np.sum((y_pred == 0) & (y_true == 0))
  return tp, fp, fn, tn

def calculate_recall(tp, fp, fn, tn):
  return (tp)/(tp + fn)

def calculate_fallout(tp, fp, fn, tn):
  return (fp)/(fp + tn)

def calculate_at_threshold(threshold, actual, predicted):
  p = np.where(predicted >= threshold, 1, 0)
  tp, fp, fn, tn = calculate_cm(p, actual)
  tpr = calculate_recall(tp, fp, fn, tn)
  fpr = calculate_fallout(tp, fp, fn, tn)
  return fpr, tpr 

def roc_curve(actual, predicted, thresholds):
  tpr = []
  fpr = []
  for threshold in thresholds:
    fpr_t, tpr_t = calculate_at_threshold(threshold, actual, predicted)
    tpr.append(fpr_t)
    fpr.append(tpr_t)
  return fpr, tpr
irvifa
  • 1,865
  • 2
  • 16
  • 20
0

This is a slightly faster version of Flavia Giammarino's answer which only uses NumPy arrays; I also added a few comments and provided alternative, more generic variable names:

import numpy as np

def roc_curve(probabilities, ground_truth, thresholds):

    # Initialize FPR & TPR arrays
    fpr = np.empty_like(thresholds)
    tpr = np.empty_like(thresholds)

    # Compute FPR & TPR
    for t in range(0, len(thresholds)):
        y_pred = np.where(ground_truth >= thresholds[t], 1, 0)
        fp = np.sum((y_pred == 1) & (probabilities == 0))
        tp = np.sum((y_pred == 1) & (probabilities == 1))
        fn = np.sum((y_pred == 0) & (probabilities == 1))
        tn = np.sum((y_pred == 0) & (probabilities == 0))
        fpr[t] = fp / (fp + tn)
        tpr[t] = tp / (tp + fn)

    return fpr, tpr

Thresholds can be easily generated with a function like NumPy's linspace:

np.linspace(start, end, n)

where [start, end] is the thresholds' range (extremes included; should be start = 0 and end = 1) and n is the number of thresholds; from experience I can say that n = 50 is a good trade-off between speed and accuracy, although n >= 100 yields smoother curves.

Zelethil
  • 11
  • 4