I am trying to understand the predict
function in Python statsmodels for a Logit model. Its documentation is here.
When I build a Logit Model and use predict
, it returns values from 0 to 1 as opposed to 0 or 1. Now I read this saying these are probabilities and we need a threshold. Python statsmodel.api logistic regression (Logit)
Now, I want to produce AUC numbers and I use roc_auc_score
from sklearn (docs).
Here is when I start getting confused.
- When I put in the raw predicted values (probabilities) from my Logit model into the
roc_auc_score
as the second argumenty_score
, I get a reasonable AUC value of around 80%. How does theroc_auc_score
function know which of my probabilities equate to 1 and which equate to 0? Nowhere was I given an opportunity to set the threshold. - When I manually convert my probabilities into 0 or 1 using a threshold of 0.5, I get an AUC of around 50%. Why would this happen?
Here's some code:
m1_result = m1.fit(disp = False)
roc_auc_score(y, m1_result.predict(X1))
AUC: 0.80
roc_auc_score(y, [1 if X >=0.5 else 0 for X in m1_result.predict(X1)])
AUC: 0.50
Why is this the case?