What is the difference between a sigmoid followed by the cross entropy and sigmoid_cross_entropy_with_logits in TensorFlow?

Question

When trying to get cross-entropy with sigmoid activation function, there is a difference between

loss1 = -tf.reduce_sum(p*tf.log(q), 1)
loss2 = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q),1)

But they are the same when with softmax activation function.

Following is the sample code:

import tensorflow as tf

sess2 = tf.InteractiveSession()
p = tf.placeholder(tf.float32, shape=[None, 5])
logit_q = tf.placeholder(tf.float32, shape=[None, 5])
q = tf.nn.sigmoid(logit_q)
sess.run(tf.global_variables_initializer())

feed_dict = {p: [[0, 0, 0, 1, 0], [1,0,0,0,0]], logit_q: [[0.2, 0.2, 0.2, 0.2, 0.2], [0.3, 0.3, 0.2, 0.1, 0.1]]}
loss1 = -tf.reduce_sum(p*tf.log(q),1).eval(feed_dict)
loss2 = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q),1).eval(feed_dict)

print(p.eval(feed_dict), "\n", q.eval(feed_dict))
print("\n",loss1, "\n", loss2)

Maxim · Answer 1 · 2017-11-15T14:46:29.427

You're confusing the cross-entropy for binary and multi-class problems.

Multi-class cross-entropy

The formula that you use is correct and it directly corresponds to tf.nn.softmax_cross_entropy_with_logits:

-tf.reduce_sum(p * tf.log(q), axis=1)

p and q are expected to be probability distributions over N classes. In particular, N can be 2, as in the following example:

p = tf.placeholder(tf.float32, shape=[None, 2])
logit_q = tf.placeholder(tf.float32, shape=[None, 2])
q = tf.nn.softmax(logit_q)

feed_dict = {
  p: [[0, 1],
      [1, 0],
      [1, 0]],
  logit_q: [[0.2, 0.8],
            [0.7, 0.3],
            [0.5, 0.5]]
}

prob1 = -tf.reduce_sum(p * tf.log(q), axis=1)
prob2 = tf.nn.softmax_cross_entropy_with_logits(labels=p, logits=logit_q)
print(prob1.eval(feed_dict))  # [ 0.43748799  0.51301527  0.69314718]
print(prob2.eval(feed_dict))  # [ 0.43748799  0.51301527  0.69314718]

Note that q is computing tf.nn.softmax, i.e. outputs a probability distribution. So it's still multi-class cross-entropy formula, only for N = 2.

Binary cross-entropy

This time the correct formula is

p * -tf.log(q) + (1 - p) * -tf.log(1 - q)

Though mathematically it's a partial case of the multi-class case, the meaning of p and q is different. In the simplest case, each p and q is a number, corresponding to a probability of the class A.

Important: Don't get confused by the common p * -tf.log(q) part and the sum. Previous p was a one-hot vector, now it's a number, zero or one. Same for q - it was a probability distribution, now's it's a number (probability).

If p is a vector, each individual component is considered an independent binary classification. See this answer that outlines the difference between softmax and sigmoid functions in tensorflow. So the definition p = [0, 0, 0, 1, 0] doesn't mean a one-hot vector, but 5 different features, 4 of which are off and 1 is on. The definition q = [0.2, 0.2, 0.2, 0.2, 0.2] means that each of 5 features is on with 20% probability.

This explains the use of sigmoid function before the cross-entropy: its goal is to squash the logit to [0, 1] interval.

The formula above still holds for multiple independent features, and that's exactly what tf.nn.sigmoid_cross_entropy_with_logits computes:

p = tf.placeholder(tf.float32, shape=[None, 5])
logit_q = tf.placeholder(tf.float32, shape=[None, 5])
q = tf.nn.sigmoid(logit_q)

feed_dict = {
  p: [[0, 0, 0, 1, 0],
      [1, 0, 0, 0, 0]],
  logit_q: [[0.2, 0.2, 0.2, 0.2, 0.2],
            [0.3, 0.3, 0.2, 0.1, 0.1]]
}

prob1 = -p * tf.log(q)
prob2 = p * -tf.log(q) + (1 - p) * -tf.log(1 - q)
prob3 = p * -tf.log(tf.sigmoid(logit_q)) + (1-p) * -tf.log(1-tf.sigmoid(logit_q))
prob4 = tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q)
print(prob1.eval(feed_dict))
print(prob2.eval(feed_dict))
print(prob3.eval(feed_dict))
print(prob4.eval(feed_dict))

You should see that the last three tensors are equal, while the prob1 is only a part of cross-entropy, so it contains correct value only when p is 1:

[[ 0.          0.          0.          0.59813893  0.        ]
 [ 0.55435514  0.          0.          0.          0.        ]]
[[ 0.79813886  0.79813886  0.79813886  0.59813887  0.79813886]
 [ 0.5543552   0.85435522  0.79813886  0.74439669  0.74439669]]
[[ 0.7981388   0.7981388   0.7981388   0.59813893  0.7981388 ]
 [ 0.55435514  0.85435534  0.7981388   0.74439663  0.74439663]]
[[ 0.7981388   0.7981388   0.7981388   0.59813893  0.7981388 ]
 [ 0.55435514  0.85435534  0.7981388   0.74439663  0.74439663]]

Now it should be clear that taking a sum of -p * tf.log(q) along axis=1 doesn't make sense in this setting, though it'd be a valid formula in multi-class case.

logit_q can be anything from -infinity to +infinity. I guess the way you make the logit to look like probability is a bit mis-leading? — LKS, Feb 19 '18 at 23:02
Logit is log-probability, it's never stated it is like a probability. — Maxim, Feb 19 '18 at 23:07
I am not trying to say there is a mistake. Of course you never state that it is a probability. Logit can be any number but just the choice of picking them to be [0.2,0.8] makes it look misleading. Btw, I think logit is usually interpreted as log-odds where odds = `p/(1-p)` where `p` is interpreted as probability? — LKS, Feb 20 '18 at 15:03
I see what you mean. This choice simply matches `logit_q` from the question. But you're right, it can be anything. And you're also right, calling it "log-odds" would be more precise, but people also say "log-probability" meaning the same thing — Maxim, Feb 20 '18 at 15:11
I guess the person who post the question may be confused in a few places. Thanks for your answer. It also clears my doubt about `tf.nn.sigmoid_cross_entropy_with_logits`. — LKS, Feb 20 '18 at 15:16
During training for the multi-label case, is it correct to use as the loss to minimize the sum along axis=1 of the tf.nn.sigmoid_cross_entropy_with_logits vector ? — guik, Feb 22 '18 at 09:08
Yes, actually it's common to reduce it to the *mean* along all axes. — Maxim, Feb 22 '18 at 10:50
@maxim can you take a crack at this? https://stackoverflow.com/questions/53612973/tensorflow-sigmoid-cross-entropy-with-logits-for-1d-data — SumNeuron, Dec 04 '18 at 12:26

zhao yufei · Answer 2 · 2020-11-21T04:35:59.027

you can understand differences between softmax and sigmoid cross entropy in following way:

for softmax cross entropy, it actually has one probability distribution
for sigmoid cross entropy, it actually has multi independently binary probability distributions, each binary probability distribution can treated as two class probability distribution

so anyway the cross entropy is:

   p * -tf.log(q)

for softmax cross entropy it looks exactly as above formula，

but for sigmoid, it looks a little different for it has multi binary probability distribution for each binary probability distribution, it is

p * -tf.log(q)+(1-p) * -tf.log(1-q)

p and (1-p) you can treat as two class probability within each binary probability distribution

What is the difference between a sigmoid followed by the cross entropy and sigmoid_cross_entropy_with_logits in TensorFlow?

2 Answers2

Multi-class cross-entropy

Binary cross-entropy

Linked