4

So I'm trying to do a prediction using python's statsmodels.api to do logistic regression on a binary outcome. I'm using Logit as per the tutorials. When I try to do a prediction on a test dataset, the output is in decimals between 0 and 1 for each of the records. Shouldn't it be giving me zero and one? or do I have to convert these using a round function or something.

Excuse the noobiness of this question. I am staring my journey.

Josef
  • 21,998
  • 3
  • 54
  • 67
Karim Lameer
  • 173
  • 1
  • 2
  • 9

2 Answers2

5

The predicted values are the probabilies given the explanatory variables, more precisely the probability of observing 1.

To get a 0, 1 prediction, you need to pick a threshold, like 0.5 for equal thresholding, and assign 1 to the probabilities above the threshold.

With numpy this would be for example

predicted = results.predict(x_for_prediction)
predicted_choice = (predicted > threshold).astype(int)
Josef
  • 21,998
  • 3
  • 54
  • 67
  • Hi,thanks for your response to this. Is 0.5 the optimum threshold or is there some way to work this out. – Karim Lameer Oct 23 '14 at 19:58
  • If you need a 0 1 point decision or classification, then the threshold will depend on your loss function. When we pick 0 or 1 prediction, we will make mistakes with some probability. If your loss from the mistake is symmetric, picking 1 when the true value is 0 and picking 0 when the true value is one has the same "cost", then picking 0.5 is the optimal threshold. If the loss is asymmetric, then we should shift the threshold to minimize the prediction loss. – Josef Oct 23 '14 at 20:06
  • Hi, thanks again for your response. How do I calculate a loss function? Will Logit do that for me or do I have to write my own. I this something to do with the report that gets generated when you call the fit() method. If you could direct me to any online resources I would be really grateful. Karim – Karim Lameer Oct 23 '14 at 21:27
  • I am late to the conversation, but I just wanted to say that the proper way seems to be of drawing samples from a Binomial(predicted) distribution as described here: http://stats.stackexchange.com/questions/46523/how-to-simulate-artificial-data-for-logistic-regression – legaultmarc Mar 10 '15 at 21:59
0

If the response is on the unit interval interpreted as a probability, in addition to loss considerations, the other perspective which may help is looking at it as a Binomial outcome, as a count instead of a Bernoulli. In particular, in addition to the probabilistic response in your problem, is there any counterpart to numbers of trials in each case? If there were, then the logistic regression could be reexpressed as a Binomial (count) response, where the (integer) count would be the rounded expected value, obtained by product of the probability and the number of trials.