What exactly is Keras's CategoricalCrossEntropy doing?

Question

I am porting a keras model over to torch and I'm having trouble replicating the exact behavior of keras/tensorflow's 'categorical_crossentropy' after a softmax layer. I have some workarounds for this problem, so I'm only interested in understanding what exactly tensorflow calculates when calculating categorical cross entropy.

As a toy problem, I set up labels and predicted vectors

>>> import tensorflow as tf
>>> from tensorflow.keras import backend as K
>>> import numpy as np


>>> true = np.array([[0.0, 1.0], [1.0, 0.0]])
>>> pred = np.array([[0.0, 1.0], [0.0, 1.0]])

And calculate the Categorical Cross Entropy with:

>>> loss = tf.keras.losses.CategoricalCrossentropy()
>>> print(loss(pred, true).eval(session=K.get_session()))
8.05904769897461

This differs from the analytical result

>>> loss_analytical = -1*K.sum(true*K.log(pred))/pred.shape[0]
>>> print(loss_analytical.eval(session=K.get_session()))
nan

I dug into the source code for keras/tf's cross entropy (see Softmax Cross Entropy implementation in Tensorflow Github Source Code) and found the c function at https://github.com/tensorflow/tensorflow/blob/c903b4607821a03c36c17b0befa2535c7dd0e066/tensorflow/compiler/tf2xla/kernels/softmax_op.cc line 116. In that function, there is a comment:

// sum(-labels *
// ((logits - max_logits) - log(sum(exp(logits - max_logits)))))
// along classes
// (The subtraction broadcasts along the batch dimension.)

And implementing that, I tried:

>>> max_logits = K.max(pred, axis=0)
>>> max_logits = max_logits
>>> xent = K.sum(-true * ((pred - max_logits) - K.log(K.sum(K.exp(pred - max_logits)))))/pred.shape[0]

>>> print(xent.eval(session=K.get_session()))
1.3862943611198906

I also tried to print the trace for xent.eval(session=K.get_session()), but the trace is ~95000 lines long. So it begs the question: what exactly is keras/tf doing when calculating 'categorical_crossentropy'? It makes sense that it doesn't return nan, that would cause training issues, but where does 8 come from?

stackoverflowuser2010 · Answer 1 · 2020-12-03T20:43:53.683

Here are some things that I noticed in your code.

First, your predictions show two data instances, [0.0, 1.0] and [0.0, 1.0].

pred = np.array([[0.0, 1.0], [0.0, 1.0]])

They should indicate probabilities, but the values after softmax typically are not exactly 0.0 and 1.0. Try 0.01 and 0.99 instead.

Second, the arguments to the CateogoricalCrossEntropy() call should be true, pred, not pred, true.

So this is what I get:

import tensorflow as tf
from tensorflow.keras import backend as K
import numpy as np

true = np.array([[0.0, 1.0], [1.0, 0.0]])
pred = np.array([[0.01, 0.99], [0.01, 0.99]])

loss = tf.keras.losses.CategoricalCrossentropy()
print(loss(true, pred).numpy())
# 2.307610273361206

For completeness, let's try what you did, using pred, true:

print(loss(pred, true).numpy())
# 8.05904769897461

That's where your mysterious 8.05 came from.

Is my answer 2.307610273361206 correct? Let's compute the loss by hand. Following the explanation in this StackOverflow post, we can compute the loss of each of the two data instances and then compute their average.

loss1 = -(0.0 * np.log(0.01) + 1.0 * np.log(0.99))
print(loss1) # 0.01005033585350145

loss2 = -(1.0 * np.log(0.01) + 0.0 * np.log(0.99))
print(loss2) # 4.605170185988091

# Total loss is the average of the per-instance losses.
loss = (loss1 + loss2) / 2
print(loss) # 2.307610260920796

So it looks like CategoricalCrossEntropy() is producing the right answer.

Thanks for this. I agree that most softmax outputs are not _exactly_ zero, but the training in my torch ported code often ends up with a loss of `nan`, which I believe is occurring because one of the predictions is zero. It looks like there is a clipping applied in keras (by epsilon, as in @xdurch0's answer). — ahagen, Dec 03 '20 at 21:21

score 6 · Accepted Answer · answered Dec 03 '20 at 19:19

The problem is that you are using hard 0s and 1s in your predictions. This leads to nan in your calculation since log(0) is undefined (or infinite).

What is not really documented is that the Keras cross-entropy automatically "safeguards" against this by clipping the values to be inside the range [eps, 1-eps]. This means that, in your example, Keras gives you a different result because it flat out replaces the predictions by other values.

If you replace your predictions by soft values, you should be able to reproduce the results. This makes sense anyway, since your networks will usually return such values via a softmax activation; hard 0/1 only happens in the case of numerical underflow.

If you want to check this for yourself, the clipping happens here. This function is eventually called by the CategoricalCrossentropy function. epsilon is defined elsewhere, but it seems to be 0.0000001 -- try your manual calculation with pred = np.clip(pred, 0.0000001, 1-0.0000001) and you should see the result 8.059047875479163.

Great! I knew the discrepancy was in how Keras was nan guarding, and you found it. I was able to reproduce with Keras and with torch.clamp. Interestingly, this uses the analytical calculation (sum(-true * log(pred))/batch_size), not the log(sum(exp described in Keras's `softmax_op.cc`. — ahagen, Dec 03 '20 at 21:20

What exactly is Keras's CategoricalCrossEntropy doing?

2 Answers2