Interpreting a sigmoid result as probability in neural networks

Question

I've created a neural network with a sigmoid activation function in the last layer, so I get results between 0 and 1. I want to classify things in 2 classes, so I check "is the number > 0.5, then class 1 else class 0". All basic. However, I would like to say "the probability of it being in class 0 is x and in class 1 is y".

How can I do this?

Does a number like 0.73 tell me it's 73% sure to be in class 1? And then 1-0.73 = 0.27 so 27% in class 0?
When it's 0.27, does that mean it's 27% sure in class 0, 73% in class 1? Makes no sense.

Should I work with the 0.5 and look "how far away from the center is the number, and then that's the percentage"?

Or am I misunderstanding the result of the NN?

Conventional neural networks are not probabilistic models. While interpreting a sigmoid output as a probability is pretty common, the truth is that there is really no connection between that value and the mathematical concept of probability. Taking 0.5 as split point is pretty common, but also leaving a mid range (e.g. 0.2-0.8) as "don't know". But these are all heuristics really. You can use a [ROC curve](https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5) to analyse what happens with different thresholds - that may at least give you "probabilities" in terms of TPR/FPR. — jdehesa, Sep 12 '19 at 09:40
@jdehesa Yes, I kinda thought that already, that the model just learns "is it this or that", not "how much of this is it". I got confused by so many articles on the net saying "sigmoid can be interpreted as probability because it's between 0 and 1!" and then not answer my question above. I think I should abandon the idea, and just accept the classification as it is. Thanks for the link to more information though! — Tominator, Sep 12 '19 at 09:52
You should note though that in the context of using cross-entropy loss for training, we do actually interpret the sigmoid output as p(y=1), so in a sense this probabilistic interpretation is "baked into" the network. — xdurch0, Sep 12 '19 at 13:05

score 5 · Answer 1 · answered Sep 12 '19 at 13:10

As pointed out by Teja, the short answer is no, however, depending on the loss you use, it may be closer to truth than you may think.

Imagine you try to train your network to differentiate numbers into two arbitrary categories that are beautiful and ugly. Say your input number are either 0 or 1 and 0s have a 0.2 probability of being labelled ugly whereas 1s have o 0.6probability of being ugly.

Imagine that your neural network takes as inputs 0s and 1s, passes them into some layers, and ends in a softmax function. If your loss is binary cross-entropy, then the optimal solution for your network is to output 0.2 when it sees a 0 in input and 0.6 when it sees a 1 in input (this is a property of the cross-entropy which is minimized when you output the true probabilities of each label). Therefore, you can interpret these numbers as probabilities.

Of course, real world examples are not that easy and are generally deterministic so the interpretation is a little bit tricky. However, I believe that it is not entirely false to think of your results as probabilities as long as you use the cross-entropy as a loss.

I'm sorry, this answer is not black or white, but reality is sometimes complex ;)

Teja · Answer 2 · 2019-09-12T13:05:45.553

0

Does a number like 0.73 tell me it's 73% sure to be in class 1? And then 1-0.73 = 0.27 so 27% in class 0?

The Answer is No. When we are using Sigmoid Function the sum of the results will not sum to 1.There are chances that sum of results of the classes will be less than 1 or in some cases it will be greater than 1.

In the same case,when we use the softmax function. The sum of all the outputs will be added to 1.

edited Sep 12 '19 at 13:05

answered Sep 12 '19 at 12:57

Teja

837
3
14
24

4

This is not true for simple binary classification, where you only have a single output that is in the range [0, 1] and which gives p(y=1). Then p(y=0) = 1 - p(y=1) and the probabilities sum to 1 "by design". – xdurch0 Sep 12 '19 at 13:04

Interpreting a sigmoid result as probability in neural networks

2 Answers2

Linked