41

What exactly does the LogisticRegression.predict_proba function return?

In my example I get a result like this:

array([
    [4.65761066e-03, 9.95342389e-01],
    [9.75851270e-01, 2.41487300e-02],
    [9.99983374e-01, 1.66258341e-05]
])

From other calculations, using the sigmoid function, I know, that the second column is the probabilities. The documentation says that the first column is n_samples, but that can't be, because my samples are reviews, which are texts and not numbers. The documentation also says that the second column is n_classes. That certainly can't be, since I only have two classes (namely, +1 and -1) and the function is supposed to be about calculating probabilities of samples really being of a class, but not the classes themselves.

What is the first column really and why it is there?

cottontail
  • 10,268
  • 18
  • 50
  • 51
Zelphir Kaltstahl
  • 5,722
  • 10
  • 57
  • 86

2 Answers2

77
4.65761066e-03 + 9.95342389e-01 = 1
9.75851270e-01 + 2.41487300e-02 = 1
9.99983374e-01 + 1.66258341e-05 = 1

The first column is the probability that the entry has the -1 label and the second column is the probability that the entry has the +1 label. Note that classes are ordered as they are in self.classes_.

If you would like to get the predicted probabilities for the positive label only, you can use logistic_model.predict_proba(data)[:,1]. This will yield you the [9.95342389e-01, 2.41487300e-02, 1.66258341e-05] result.

iulian
  • 5,494
  • 3
  • 29
  • 39
  • I totally didn't see that! Thanks for the quick clarification. I now wonder more than before what the documentation is talking about. – Zelphir Kaltstahl Apr 17 '16 at 20:29
  • 3
    The documentation says the following: returns the probability of the sample for each class in the model. @Zelphir: you saw in the docs: [n_samples, n_classes]. This refers to the output: it will return a matrix, where the rows are the samples, and the columns the classes (-1, 1). As Iulian said: you will get for every row a probability prediction for class being -1 and a probabilty for class being 1. – Sander van den Oord Apr 26 '16 at 12:49
  • 9
    How do we check the order of the classes? I mean how do you know that the first column is the probability of the class of -1? – Reihan_amn Oct 16 '18 at 20:39
  • Is there a way to determine the probability score for the sample from the probability for classes? – akalanka Feb 28 '19 at 19:16
  • 2
    @Reihan_amn If you read the pydoc, or if you take a look at the source code, of predict_proba(), you can read : `Returns p : array of shape (n_samples, n_classes) [..] The class probabilities of the input samples. The order of the classes corresponds to that in the attribute 'classes_'.` – Whole Brain Nov 24 '20 at 14:01
0

As iulian explained, each row of predict_proba()'s result is the probabilities that the observation in that row is of each class (and the classes are ordered as they are in lr.classes_).

In fact, it's also intimately tied to predict() in that each row's highest probability class is chosen by predict(). So for any LogisticRegression (or any classifier really), the following is True.

lr = LogisticRegression().fit(X, y)
highest_probability_classes = lr.predict_proba(X).argmax(axis=1)
all(lr.predict(X) == lr.classes_[highest_probability_classes])     # True
cottontail
  • 10,268
  • 18
  • 50
  • 51