Why can't the output of the network go through a softmax when using softmax_cross_entropy_with_logits?

Question

I want to use the tensorflow built-in cross-entropy function. However, in the documentation, I'm reading

Do not call this op with the output of softmax, as it will produce incorrect results.

https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits

Like it is done often, I am using the softmax activation in my last output layer, however:

result = tf.layers.dense(input=dropout, classes_num, tf.nn.softmax)

Is it, therefore, incorrect to use this function, or is the documentation incorrect? I don't understand this, I would be thankful for a short explanation. (Which TensorFlow cost function would be correct to use for a softmax output layer then?)

See also this question - https://stackoverflow.com/q/47034888/712995 — Maxim, Nov 05 '17 at 12:44

score 5 · Accepted Answer · answered Nov 05 '17 at 11:14

Since tf.nn.softmax_cross_entropy_with_logits computes internally the softmax (in a numerically stable way) of its input, you have to define your network in order to use the linear activation function: tf.identity

result = tf.layers.dense(input=dropout, classes_num, tf.identity)

Moreover, once the network has been trained and you want to use use the model for inference, you have to replace the activation with the softmax.

Thus, introduce in your code a is_training python boolean variable, and use it to change your model definition when you're training or testing.

result = tf.layers.dense(input=dropout, classes_num,
             tf.identity if is_training else tf.nn.softmax)

I see. In your implementation, it would be correct to still use `tf.losses.softmax_cross_entropy`, which essentially is `tf.nn.softmax_cross_entropy_with_logits` as the cost function during training, correct? — sandboxj, Nov 05 '17 at 11:37

score 4 · Answer 2 · answered Nov 05 '17 at 11:04

The function that you have mentioned is tf.nn.softmax_cross_entropy_with_logits. As the name suggests, it first performs a softmax (i.e scaling) on logits and then calculates the entropy between logits and labels.

Therefore, if you input logits (as result in your code) that already performed the softmax then you perform the softmax twice on your logits, which will produce incorrect results.

Hope this helps.

Why can't the output of the network go through a softmax when using softmax_cross_entropy_with_logits?

2 Answers2