Iris dataset - Plotting ROC curve for feature ranking / feature selection and interpreting it

Question

I've been referring to an article on feature selection and need help in understanding how an ROC curve has been plotted. Dataset used: Iris

One of the ways for feature selection, mentioned in the article is : Visual ways to rank features

The example below plots the ROC curve of various features.

from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from sklearn.metrics import auc
import numpy as np# loading dataset
data = load_iris()
X, y = data.data, data.targety_ = y == 2plt.figure(figsize=(13,7))
for col in range(X.shape[1]):
    tpr,fpr = [],[]
    for threshold in np.linspace(min(X[:,col]),max(X[:,col]),100):
        detP = X[:,col] < threshold
        tpr.append(sum(detP & y_)/sum(y_))# TP/P, aka recall
        fpr.append(sum(detP & (~y_))/sum((~y_)))# FP/N
        
    if auc(fpr,tpr) < .5:
        aux = tpr
        tpr = fpr
        fpr = aux
    plt.plot(fpr,tpr,label=data.feature_names[col] + ', auc = '\
                           + str(np.round(auc(fpr,tpr),decimals=3)))plt.title('ROC curve - Iris features')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

I want to understand this bit:

for threshold in np.linspace(min(X[:,col]),max(X[:,col]),100):
    detP = X[:,col] < threshold
    tpr.append(sum(detP & y_)/sum(y_)) # TP/P, aka recall
    fpr.append(sum(detP & (~y_))/sum((~y_)))# FP/N

How can one calculate True Positivity Rate (TPR) & FPR by checking if values of a discrete variable (features) are above a threshold which has been calculated by dividing the range (Max-Min) of the feature in 100 equidistant points?

Here is the resultant ROC curve

Not an answer but this is NOT a ROC curve. A ROC curves visits **all** thresholds. Besides not being a ROC curve, a linspace array will be a terrible choice for many real world applications where the data is not close to a uniform distribution, or maybe even a gaussian. — Calimo, Jul 26 '20 at 07:29
Link to the article : https://towardsdatascience.com/feature-selection-techniques-for-classification-and-python-tips-for-their-application-10 — , Jul 26 '20 at 07:33
But is the question: "how can a ROC curve have 100 equidistant thresholds " or "how is the TPR / FPR calculated"? — Calimo, Jul 26 '20 at 07:37
Article title: Feature selection techniques for classification and Python tips for their application — , Jul 26 '20 at 07:38

Calimo · Answer 1 · 2020-07-26T09:16:50.103

Let's start with "how can one calculate [a ROC curve] with [a set of] threshold which has been calculated by dividing the range (Max-Min) of the feature in 100 equidistant points?"

One can't!

A ROC curve shows how TPR and FPR vary at every possible threshold by definition. Typically one uses the data itself to establish this set, and takes every unique data point as a threshold.

Limiting it to 100 equally thresholds will give an approximation of the ROC curve at best. It might be a decent approximation in case the data is probabilities. In many real world applications where the data is not uniform, or even gaussian, this will be a terrible approximation.

Just don't do it!

Instead, use a dedicated function from a well-reviewed package such as sklearn:

from sklearn.metrics import roc_curve
fpr, tpr = roc_curve(y, X[:,col])

To plot it, see the answers of How to plot ROC curve in Python for instance.

Now for the second question: how is the TPR / FPR calculated from thresholds, again this is by definition: the TPR or True Positive Rate is the fraction of actual positives that are correctly identified. I'll refer to the corresponding wikipedia article here, which explains it in more detail that can be covered here.

Can you point me towards a resource that plots an ROC curve from thresholds in a comprehensive manner? — , Jul 26 '20 at 08:58
Sure, I added a link to a related question in my answer: https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python — Calimo, Jul 26 '20 at 09:17

Iris dataset - Plotting ROC curve for feature ranking / feature selection and interpreting it

1 Answers1

One can't!