0

I am trying to find the optimal threshold T of X to predict Y. I would normally use Youden's J in this setting, however when the threshold is a lower bound (in the case where Y varies inversely to X), the classic implementation does not seem to hold.

The following post has some partial answers (1st answers produces better results), but the method is not reliable according to the comments and no paper is cited: Roc curve and cut off point. Python

def cutoff_youdens_j(fpr, tpr, thresholds):
    j_scores = tpr-fpr # J = sensivity (=tpr) + specificity (=1-fpr) - 1
    j_ordered = sorted(zip(j_scores, thresholds))
    return j_ordered[-1][1]

import numpy as np
from sklearn.metrics import roc_curve

X = np.arange(1, 10)
# Y is an example of a binary dependent variable that varies inversely to the predictor X
Y = X < 5

fpr, tpr, thresholds = roc_curve(Y, X)
T = cutoff_youdens_j(fpr, tpr, thresholds)
print(T) 
# OUTPUT: 10

Expected output would be 5, however I get 10.
Are there any better methods for optimal threshold selection and is there a paper demonstrating this? It would also be interesting to get if it actually is a lower or upper bound.

EDIT: A possibility would be the inverse X beforehand and then inverse T.

X = np.arange(1, 10)
Y = X < 5
X = -X
fpr, tpr, thresholds = roc_curve(Y, X)
T = cutoff_youdens_j(fpr, tpr, thresholds)
T = -T
print(T) #OUTPUT 4 

This works, but the direction of the association has to be determined beforehand. Are there any other methods that work with both positive and negative associations between X and Y?

MonsieurWave
  • 199
  • 3
  • 10
  • There are many optimal threshold selection methods, as the 1.9 million results in google scholar show: https://scholar.google.ch/scholar?q=optimal+threshold+selection Which one is better for you is a whole research question in itself, and very much out of scope on this programming site. – Calimo May 24 '19 at 14:59

1 Answers1

2

Your problem is that the positive class has lower X values. Sklearn assumes higher values for the positive class, otherwise the ROC curve gets inverted, here with an AUC of 0.0:

from sklearn.metrics import roc_auc_score
print(roc_auc_score(Y, X))
# OUTPUT: 0.0

ROC analysis comes from the field of signal detection, and it critically depends on the definition of a positive signal, ie the direction of the comparison. Some libraries can automatically detect that for you, some don't, but in the end it always has to be done.

And so the rest is correct, the "best" threshold in this case is one of the corner of the curve.

Just make sure your positive class is set properly, and you're good to go:

Y = X > 5
Calimo
  • 7,510
  • 4
  • 39
  • 61
  • The `Y = X < 5` was just meant as an example of a binary dependent variable that varies inversely to the predictor X. As in most cases one does not define this variable, should one invert Y via `Y = abs(1-Y) ` if an inverse relation with X is suspected? – MonsieurWave May 23 '19 at 20:33
  • Oh I see. But my answer stands, your auc is 0 and best threshold is 10 according to the Youden definition. I would inverse X instead (`X = -X`), otherwise you are messing with the definition of a positive and switching all the rates etc. – Calimo May 23 '19 at 21:35
  • Yeah I have tried this, in which case one would get `-4` as output but it still feels a bit weird as one would have to determine the direction of the correlation beforehand. – MonsieurWave May 24 '19 at 12:02
  • You get -T now, that's true, so you'll need to multiply it by -1 it again to get T. And yes, in ROC analysis the direction of the correlation matters and you have to determine it beforehand. Some libraries do that automatically for you, but it they still do it, always. – Calimo May 24 '19 at 14:52