Sigmoid function outputs a number between 0 and 1. Is this a probability or is it merely a 'yes or no' depending on whether it's above or below 0.5?
Minimal example:
Cats vs dogs binary classification. 0 is cat, 1 is dog.
Can I perform the following interpretation of the sigmoid output values:
- 0.9 - it's most certainly a dog
- 0.52 - it's more likely to be a dog than a cat, but still quite unsure
- 0.5 - completely undecided, could be either a cat or a dog
- 0.48 - it's more likely to be a cat than a dog, but still quite unsure
- 0.1 - it's most certainly a cat
Or would this be the right way to interpret the results:
- 0.9 - it's a dog
- 0.52 - it's a dog
- 0.5 - completely undecided, could be either a cat or a dog
- 0.48 - it's a cat
- 0.1 - it's a cat
Note how in first case we utilise the numeric value to also express probabilities, while in the second case we completely ignore the probability interpretation and collapse the answers to binary. Which is correct? Can you explain why?
Background context, feel free to skip this:
I've found a number of sources that suggest that yes, sigmoid output can be interpreted as probability:
- Source yes 1 - (...) sigmoid(z) will yield a value (a probability) between 0 and 1.
- Source yes 2 - The "output" must come from a function that satisfies the properties of a distribution function in order for us to interpret it as probabilities. (...) The "sigmoid function" satisfies these properties.
- Source yes 3 -
tf.sigmoid(logits)
gives you the probabilities.
And a number of sources that suggest contrary, that sigmoid output cannot be interpreted as probabilities:
- Source no 1 - (...) the raw values cannot necessarily be interpreted as raw probabilities!
- Source no 2 - Sigmoid (...) is not a probability distribution function
- Source no (and also yes) 3 - the short answer is no, however, depending on the loss you use, it may be closer to truth than you may think.
(bonus questions, answer to win a car!) Why are there so many contradicting answers? What do these answers differ in? I find it unlikely that it's just a lot of people being completely wrong about it - I'm thinking they're just talking about different cases or some different fundamental assumptions. What's the difference that I'm missing?
I know I can just use a softmax. I also know that sigmoid can be used for non-exclusive multi-class classification (Source multi 1, Source multi 2, Source multi 3) - although even then it's unclear whether such multiple sigmoids output probabilities of various classes or again simply a 'yes or no', but for multiple classes. In my case though, I'm interested in exclusive two-class (binary) classification, and whether sigmoid can be used to determine its probabilities, or should two-class softmax be used.