16

Sigmoid function outputs a number between 0 and 1. Is this a probability or is it merely a 'yes or no' depending on whether it's above or below 0.5?

Minimal example:

Cats vs dogs binary classification. 0 is cat, 1 is dog.

Can I perform the following interpretation of the sigmoid output values:

  • 0.9 - it's most certainly a dog
  • 0.52 - it's more likely to be a dog than a cat, but still quite unsure
  • 0.5 - completely undecided, could be either a cat or a dog
  • 0.48 - it's more likely to be a cat than a dog, but still quite unsure
  • 0.1 - it's most certainly a cat

Or would this be the right way to interpret the results:

  • 0.9 - it's a dog
  • 0.52 - it's a dog
  • 0.5 - completely undecided, could be either a cat or a dog
  • 0.48 - it's a cat
  • 0.1 - it's a cat

Note how in first case we utilise the numeric value to also express probabilities, while in the second case we completely ignore the probability interpretation and collapse the answers to binary. Which is correct? Can you explain why?


Background context, feel free to skip this:

I've found a number of sources that suggest that yes, sigmoid output can be interpreted as probability:

  • Source yes 1 - (...) sigmoid(z) will yield a value (a probability) between 0 and 1.
  • Source yes 2 - The "output" must come from a function that satisfies the properties of a distribution function in order for us to interpret it as probabilities. (...) The "sigmoid function" satisfies these properties.
  • Source yes 3 - tf.sigmoid(logits) gives you the probabilities.

And a number of sources that suggest contrary, that sigmoid output cannot be interpreted as probabilities:

  • Source no 1 - (...) the raw values cannot necessarily be interpreted as raw probabilities!
  • Source no 2 - Sigmoid (...) is not a probability distribution function
  • Source no (and also yes) 3 - the short answer is no, however, depending on the loss you use, it may be closer to truth than you may think.

(bonus questions, answer to win a car!) Why are there so many contradicting answers? What do these answers differ in? I find it unlikely that it's just a lot of people being completely wrong about it - I'm thinking they're just talking about different cases or some different fundamental assumptions. What's the difference that I'm missing?


I know I can just use a softmax. I also know that sigmoid can be used for non-exclusive multi-class classification (Source multi 1, Source multi 2, Source multi 3) - although even then it's unclear whether such multiple sigmoids output probabilities of various classes or again simply a 'yes or no', but for multiple classes. In my case though, I'm interested in exclusive two-class (binary) classification, and whether sigmoid can be used to determine its probabilities, or should two-class softmax be used.

Voy
  • 5,286
  • 1
  • 49
  • 59

2 Answers2

6

A sigmoid function is not a probability density function (PDF), as it integrates to infinity. However, it corresponds to the cumulative probability function of the logistic distribution.

Regarding your interpretation of the results, even though the sigmoid is not a PDF, given that its values lie in the interval [0,1], you can still interpret them as a confidence index. With that in mind, I would say that your first interpretation is the most appropriate one, although you are free to implement whichever classifier suits your purposes better.

edu_
  • 850
  • 2
  • 8
  • 16
  • 3
    Could you elaborate on why it is allowed to _'still interpret them as a confidence index'_, especially that sigmoid is not a PDF? What I'm trying to understand here is not only what to do, but why should I do it. Is it because _'it corresponds to the cumulative probability function of the logistic distribution.'_? – Voy Nov 27 '19 at 17:38
6

I think the contradiction between your provided links comes from a semantic definition of probability vs an intuitive one. I think the intuitive interpretation of "an output closer to 1 is more likely to be correct" is the right intuition, but that the number isn't a direct correlation with the probability. For example, we couldn't say that a 1 is twice as likely as .5 to be a dog.

There are problems like overfitting that make the purely mathematics probability viewpoint incorrect. However, since you have to pick one of the two options for your program, it makes sense to interpret the result as the binary greater or less than .5 approach, or maybe you should try allowing an adjustable margin of error (for example, .5 +/- x is undecided).

yo conway
  • 207
  • 1
  • 5
  • That distinction between semantic and intuitive understanding is a very interesting observation. Intuition aside, the 0.5 +/- _x_ margin of error approach is what I'm currently using. However, I find it troublesome to confidently select the _x_ margin value. Having a 0.75 probability from softmax would make me feel quite confident, whilst having a 0.75 value from sigmoid still leaves me questioning whether that's already good enough. Are there any methods to find that confidence margin _x_, other than just empirical trial and error? Say, fraction of standard variation from 0.5 for train data? – Voy Nov 27 '19 at 17:48
  • 1
    I don't know if there's a good way to figure out the margin of error other than running your algorithm against a test set to get the output values. Then you can map any error value "x" to a confidence percentage of false positives. – yo conway Nov 29 '19 at 16:18