Controlling the threshold in Logistic Regression in Scikit Learn

Question

I am using the LogisticRegression() method in scikit-learn on a highly unbalanced data set. I have even turned the class_weight feature to auto.

I know that in Logistic Regression it should be possible to know what is the threshold value for a particular pair of classes.

Is it possible to know what the threshold value is in each of the One-vs-All classes the LogisticRegression() method designs?

I did not find anything in the documentation page.

Does it by default apply the 0.5 value as threshold for all the classes regardless of the parameter values?

Well, since LR is a probabilistic classifier, that is, it returns probability of a class, it makes sense to use 0.5 as a threshold. — Artem Sobolev, Feb 25 '15 at 11:15

score 36 · Answer 1 · edited Nov 08 '18 at 01:57

There is a little trick that I use, instead of using model.predict(test_data) use model.predict_proba(test_data). Then use a range of values for thresholds to analyze the effects on the prediction;

pred_proba_df = pd.DataFrame(model.predict_proba(x_test))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
    print ('\n******** For i = {} ******'.format(i))
    Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
    test_accuracy = metrics.accuracy_score(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),
                                           Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1))
    print('Our testing accuracy is {}'.format(test_accuracy))

    print(confusion_matrix(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),
                           Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1)))

Best!

I like this answer. What I am struggling to understand is how would one tie this into GridSearchCV? When I am running GridSearchCV, I am finding the best model among many. Presumably, the default threshold for Logistic Regression of 0.5 is being used internally and so then how would I override this default threshold when scoring takes place to pick the best model. — demongolem, Sep 08 '20 at 21:06
@demongolem you can use threshold-independent metric like roc_auc to find the best parameters through GridSearch, and then set the threshold manually after having identified the best parameters — Charlie, Apr 21 '23 at 12:52

score 24 · Answer 2 · edited May 23 '17 at 12:10

Logistic regression chooses the class that has the biggest probability. In case of 2 classes, the threshold is 0.5: if P(Y=0) > 0.5 then obviously P(Y=0) > P(Y=1). The same stands for the multiclass setting: again, it chooses the class with the biggest probability (see e.g. Ng's lectures, the bottom lines).

Introducing special thresholds only affects in the proportion of false positives/false negatives (and thus in precision/recall tradeoff), but it is not the parameter of the LR model. See also the similar question.

veg2020 · Accepted Answer · 2021-04-01T13:23:40.337

Yes, Sci-Kit learn is using a threshold of P>=0.5 for binary classifications. I am going to build on some of the answers already posted with two options to check this:

One simple option is to extract the probabilities of each classification using the output from model.predict_proba(test_x) segment of the code below along with class predictions (output from model.predict(test_x) segment of code below). Then, append class predictions and their probabilities to your test dataframe as a check.

As another option, one can graphically view precision vs. recall at various thresholds using the following code.

### Predict test_y values and probabilities based on fitted logistic 
regression model

pred_y=log.predict(test_x) 

probs_y=log.predict_proba(test_x) 
  # probs_y is a 2-D array of probability of being labeled as 0 (first 
  column of 
  array) vs 1 (2nd column in array)

from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(test_y, probs_y[:, 
1]) 
   #retrieve probability of being 1(in second column of probs_y)
pr_auc = metrics.auc(recall, precision)

plt.title("Precision-Recall vs Threshold Chart")
plt.plot(thresholds, precision[: -1], "b--", label="Precision")
plt.plot(thresholds, recall[: -1], "r--", label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc="lower left")
plt.ylim([0,1])

instantiate logistic regression in sklearn, make sure you have a test and train dataset partitioned and labeled as test_x, test_y, run (fit) the logisitc regression model on this data, the rest should follow from here. — veg2020, Mar 02 '20 at 22:42
You can save a bit of coding by using `sklearn.metrics.plot_precision_recall_curve`. — Yoav Vollansky, May 01 '20 at 18:54
Function plot_precision_recall_curve is deprecated in 1.0 and will be removed in 1.2. — JammingThebBits, Feb 16 '22 at 10:54

score 3 · Answer 4 · answered Nov 09 '22 at 10:57

we can use a wrapper as follows:

model = LogisticRegression()
model.fit(X, y)

def custom_predict(X, threshold):
    probs = model.predict_proba(X) 
    return (probs[:, 1] > threshold).astype(int)
    
    
new_preds = custom_predict(X=X, threshold=0.4)

score 0 · Answer 5 · answered Jun 23 '23 at 17:45

If using @jazib jamil's and @Halee's solution in Pandas version 0.23.0+, replace .as_matrix() with .values (documentation).

pred_proba_df = pd.DataFrame(model.predict_proba(x_test))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
    print ('\n******** For i = {} ******'.format(i))
    Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
    test_accuracy = metrics.accuracy_score(Y_test.values.reshape(Y_test.values.size,1),
                                           Y_test_pred.iloc[:,1].values.reshape(Y_test_pred.iloc[:,1].values.size,1))
    print('Our testing accuracy is {}'.format(test_accuracy))

    print(confusion_matrix(Y_test.values.reshape(Y_test.values.size,1),
                           Y_test_pred.iloc[:,1].values.reshape(Y_test_pred.iloc[:,1].values.size,1)))

Controlling the threshold in Logistic Regression in Scikit Learn

5 Answers5

Linked

Related