11

After using TensorFlow for quite a while I have read some Keras tutorials and implemented some examples. I have found several tutorials for convolutional autoencoders that use keras.losses.binary_crossentropy as the loss function.

I thought binary_crossentropy should not be a multi-class loss function and would most likely use binary labels, but in fact Keras (TF Python backend) calls tf.nn.sigmoid_cross_entropy_with_logits, which actually is intended for classification tasks with multiple, independent classes that are not mutually exclusive.

On the other hand, my expectation for categorical_crossentropy was to be intended for multi-class classifications where target classes have a dependency on each other, but are not necessarily one-hot encoded.

However, the Keras documentation states:

(...) when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros expect for a 1 at the index corresponding to the class of the sample).

If I am not mistaken, this is just the special case of one-hot encoded classification tasks, but the underlying cross-entropy loss also works with probability distributions ("multi-class", dependent labels)?

Additionally, Keras uses tf.nn.softmax_cross_entropy_with_logits (TF python backend) for the implementation, which itself states:

NOTE: While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. If they are not, the computation of the gradient will be incorrect.

Please correct me if I am wrong, but it looks to me that the Keras documentation is - at least - not very "detailed"?!

So, what is the idea behind Keras' naming of the loss functions? Is the documentation correct? If the binary cross entropy would really rely on binary labels, it should not work for autoencoders, right?! Likewise the categorical crossentropy: should only work for one-hot encoded labels if the documentation is correct?!

Maxim
  • 52,561
  • 27
  • 155
  • 209
daniel451
  • 10,626
  • 19
  • 67
  • 125

3 Answers3

8

You are right by defining areas where each of these losses are applicable:

  • binary_crossentropy (and tf.nn.sigmoid_cross_entropy_with_logits under the hood) is for binary multi-label classification (labels are independent).
  • categorical_crossentropy (and tf.nn.softmax_cross_entropy_with_logits under the hood) is for multi-class classification (classes are exclusive).

See also the detailed analysis in this question.

I'm not sure what tutorials you mean, so can't comment whether binary_crossentropy is a good or bad choice for autoencoders.

As for the naming, it is absolutely correct and reasonable. Or do you think sigmoid and softmax names sound better?

So the only confusion left in your question is the categorical_crossentropy documentation. Note that everything that has been stated is correct: the loss supports one-hot representation. This function indeed works with any probability distribution for labels (in addition to one-hot vectors) in case of tensorflow backend and it could be included into the doc, but this doesn't look critical to me. Moreover, need to check if soft classes are supported in other backends, theano and CNTK. Remember that keras tries to be minimalistic and targets for most popular use cases, so I can understand the logic here.

Maxim
  • 52,561
  • 27
  • 155
  • 209
1

Not sure if this answers your question, but for softmax loss the output layer needs to be a probability distribution (i.e. sum to 1), for binary crossentropy loss it doesn't. Simple as that. (Binary doesn't mean that there are only 2 output classes, it just means that each output is binary.)

maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • Yes (sorry for the confusion): I actually meant that for *n* output neurons each of these should be either `0` or `1`, according to the naming & keras documentation (of `binary_crossentropy`). However (again, If I am not mistaken), this is wrong: Keras (TF python backend) uses `tf.nn.sigmoid_cross_entropy_with_logits`, which is intended for use of multi-class, independent, not-mutual exclusive classification problems. That means for *n* output neurons, each of these can have a value (most likely float32) in the interval [0.0, 1.0] (sigmoid-activation). – daniel451 Dec 18 '17 at 22:23
  • The outputs of the network will be float-valued when you're using your network for scoring but you need to use binary labels when you're training; you can think of the final layer as multiple logistic regression models on the on the output of the second-to-last layer if it helps – maxymoo Dec 18 '17 at 22:41
  • This is what someone would expect from `binary_crossentropy`, right? But again, if this would really be the case, then (1) an autoencoder should not work with `binary_crossentropy` and (2) the usage of `tf.nn.sigmoid_cross_entropy_with_logits` would be wrong, since it is for independent, multi-class problems with not mutual exclusive labels. – daniel451 Dec 18 '17 at 22:46
  • And it should also be wrong for `categorical_crossentropy`, since it uses `tf.nn.softmax_cross_entropy_with_logits` and cross entropy itself in that case, like the TF implementation itself, does not rely on the special case of one-hot encoded labels (i.e. *all zero, except for the true class, which is 1*). It also works (mathematically & stated by the TF documentation) if you supply any probability distribution as labels. – daniel451 Dec 18 '17 at 22:48
  • This is why I am confused with Keras' naming of the loss functions and its documentation. Either they have some additional implementation or I am missing something. Otherwise their naming & documentation is not very detailed and partly wrong?! – daniel451 Dec 18 '17 at 22:49
  • OK, so for a CNN autoencoder, the interpretation of the outputs is different, the output is pixel intensity, not probability. Not a problem, just means you have a different interpretation for the loss function. – maxymoo Dec 19 '17 at 01:14
  • **Binary doesn't mean that there are only 2 output classes, it just means that each output is binary.** Didn't know that. thanks – ozgur Jan 23 '19 at 07:29
0

The documentation doesn't mention that BinaryCrossentropy can be used for multi-label classification and that can be confusing. But it can also be used for a binary classifier (when we have only 2 exclusive classes like cats and dogs) - see classical example. But in this case we have to set n_classes=1:

tf.keras.layers.Dense(units=1)

Also BinaryCrossentropy and tf.keras.losses.binary_crossentropy have different behavior.

Let's look at the example from the documentation to prove that it is actually for multi-label classification.

y_true = tf.convert_to_tensor([[0, 1], [0, 0]])
y_pred = tf.convert_to_tensor([[0.6, 0.4], [0.4, 0.6]])

bce = tf.keras.losses.BinaryCrossentropy()
loss1 = bce(y_true=y_true, y_pred=y_pred)
# <tf.Tensor: shape=(), dtype=float32, numpy=0.81492424>

loss2 = tf.keras.losses.binary_crossentropy(y_true, y_pred)
# <tf.Tensor: shape=(2,), dtype=float32, numpy=array([0.9162905 , 0.71355796], dtype=float32)>

np.mean(loss2.numpy())
# 0.81492424

scce = tf.keras.losses.SparseCategoricalCrossentropy()
y_true = tf.convert_to_tensor([0, 0])
scce(y_true, y_pred)
# <tf.Tensor: shape=(), dtype=float32, numpy=0.71355814>
y_true = tf.convert_to_tensor([1, 0])
scce(y_true, y_pred)
# <tf.Tensor: shape=(), dtype=float32, numpy=0.9162907>
irudyak
  • 2,271
  • 25
  • 20