4

I am trying to understand the predict function in Python statsmodels for a Logit model. Its documentation is here.

When I build a Logit Model and use predict, it returns values from 0 to 1 as opposed to 0 or 1. Now I read this saying these are probabilities and we need a threshold. Python statsmodel.api logistic regression (Logit)

Now, I want to produce AUC numbers and I use roc_auc_score from sklearn (docs).

Here is when I start getting confused.

  1. When I put in the raw predicted values (probabilities) from my Logit model into the roc_auc_score as the second argument y_score, I get a reasonable AUC value of around 80%. How does the roc_auc_score function know which of my probabilities equate to 1 and which equate to 0? Nowhere was I given an opportunity to set the threshold.
  2. When I manually convert my probabilities into 0 or 1 using a threshold of 0.5, I get an AUC of around 50%. Why would this happen?

Here's some code:

m1_result = m1.fit(disp = False)

roc_auc_score(y, m1_result.predict(X1))

AUC: 0.80

roc_auc_score(y, [1 if X >=0.5 else 0 for X in m1_result.predict(X1)])

AUC: 0.50

Why is this the case?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
user6472523
  • 211
  • 3
  • 8

2 Answers2

4

Your 2nd way of calculating the AUC is wrong; by definition, AUC needs probabilities, and not hard class predictions 0/1 generated after thresholding, as you do here. So, your AUC is 0.80.

You don't set a threshold yourself in AUC calculation; roughly speaking, as I have explained elsewhere, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.

It would be overkill to explain again here the rationale and details of AUC calculation; instead, these other SE threads (and the links therein) will help you get the idea:

desertnaut
  • 57,590
  • 26
  • 140
  • 166
4

predict yields the estimated probability of event according to your fitted model. That is, each element corresponds to the predicted probability that your model calculated for each observation.

The process behind building a ROC curve consists of selecting each predicted probability as a threshold, measuring its false positive and true positive rates and plotting these results as a line graph. The area below this curve is the AUC.

To visualize this, imagine you had the following data:

observation observed_result predicted_prob
1 0 0.1
2 0 0.5
3 1 0.9

The function roc_auc_score will do the following:

  1. Use 0.1 as the threshold such that all observations with predicted_prob ≤ 0.1 are classified as 0 and those with predicted_prob > 0.1 will be classified as 1
  2. Use 0.5 as the threshold such that all observations with predicted_prob ≤ 0.5 are classified as 0 and those with predicted_prob > 0.5 will be classified as 1
  3. Use 0.9 as the threshold such that all observations with predicted_prob ≤ 0.9 are classified as 0 and those with predicted_prob > 0.9 will be classified as 1

Each of the three different thresholds (0.1, 0.5 and 0.9) will result in its own false positive and true positive rates. The false positive rates are plotted along the x-axis, while the true positive rates are plotted in the y-axis.

As you can guess, you need to test many thresholds to plot a smooth curve. If you use 0.5 as a threshold and pass this to roc_auc_curve, you are testing out the false positive and true positive rates of a single threshold. This is incorrect and is also the reason roc_auc_curve is returning a lower AUC than before.

Instead of doing this, you may want to test the performance of a single threshold (i.e. 0.5) by calculating its corresponding accuracy, true positive rate or false positive rate.

For instance, imagine we set a threshold of 0.5 in the data above.

observation observed_result predicted_prob predicted_class
1 0 0.1 0
2 0 0.5 0
3 1 0.9 1

This is a silly example, but by using 0.5 as the cutoff value, we made a perfect prediction because the observed_result matches predicted_class in all cases.

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76