5

I have a multi-label classification in which each target is a vector of ones and zeros not mutually exclusive (for the sake of clarity, my target is something like [0, 1, 0, 0, 1, 1, ... ]).

My understanding so far is:

  • I should use a binary cross-entropy function. (as explained in this answer)

  • Also, I understood that tf.keras.losses.BinaryCrossentropy() is a wrapper around tensorflow's sigmoid_cross_entropy_with_logits. This can be used either with from_logits True or False. (as explained in this question)

  • Since sigmoid_cross_entropy_with_logits performs itself the sigmoid, it expects the input to be in the [-inf,+inf] range.

  • tf.keras.losses.BinaryCrossentropy(), when the network implements itself a sigmoid activation of the last layer, must be used with from_logits=False. It will then infert the sigmoid function and pass the output to sigmoid_cross_entropy_with_logits that will do the sigmoid again. This however can cause numerical issues due to the asymptotes of the sigmoid/logit function.

  • To improve the numerical stability, we can avoid the last sigmoid layer and use tf.keras.losses.BinaryCrossentropy(from_logits=False)

Question:

If we use tf.keras.losses.BinaryCrossentropy(from_logits=False), what target should I use? Do I need to change my target for the one-hot vector?

I suppose that I should apply then a sigmoid activation to the network output at inference time. Is there a way to add a sigmoid layer active only in inference mode and not in training mode?

today
  • 32,602
  • 8
  • 95
  • 115
Luca
  • 1,610
  • 1
  • 19
  • 30
  • "This however can cause numerical issues due to the asymptotes of the sigmoid/logit function." Can you provide a source for this, please? And further, I don't think just using a sigmoid layer and the simple `model.compile(loss='binary_crossentropy', ...)` would bring any issues. Lots of models have been trained this way without any problems! – today Apr 15 '20 at 18:57
  • https://www.tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy "Note: Using from_logits=True may be more numerically stable." – Luca Apr 15 '20 at 19:10
  • That's not a common problem in this specific case. Let me put it this way: have you encountered this specific problem (i.e. convergence problems due to numerical instability of `from_logits=False` in cross-entropy loss function) in your experiments? If not, then you should not over-think it too much. – today Apr 15 '20 at 19:15
  • This is another source https://stackoverflow.com/questions/52125924/why-does-sigmoid-crossentropy-of-keras-tensorflow-have-low-precision – Luca Apr 15 '20 at 19:17
  • I agree with you that overthinking is usually bad, but if I can improve the learning of a model just removing the final activation layer and adding an optional parameter to the loss, why shouldn't I? – Luca Apr 15 '20 at 19:19
  • While googling right now I found this interesting analysis https://towardsdatascience.com/sigmoid-activation-and-binary-crossentropy-a-less-than-perfect-match-b801e130e31 It basically sais "yes, it COULD give numerical instability, but in practice it never happens" – Luca Apr 15 '20 at 19:25
  • 1
    I know that (the author of the answer to that question happens to be me!). But as you can see there, it does not produce any **serious issue** for **practical applications** since the difference is so tiny. Anyways, I am not an expert in that area and maybe I am the one who is underthinking! Maybe someone else has a more established opinion. – today Apr 15 '20 at 19:25
  • "(the author of the answer to that question happens to be me!)" ... ops :P – Luca Apr 15 '20 at 19:26
  • If the rest I said is correct and you want to turn this comment in a short answer I will accept it – Luca Apr 15 '20 at 19:28

2 Answers2

20

First, let me give some notes about the numerical stability:

As mentioned in the comments section, the numerical instability in case of using from_logits=False comes from the transformation of probability values back into logits which involves a clipping operation (as discussed in this question and its answer). However, to the best of my knowledge, this does NOT create any serious issues for most of practical applications (although, there are some cases where applying the softmax/sigmoid function inside the loss function, i.e. using from_logits=True, would be more numerically stable in terms of computing gradients; see this answer for a mathematical explanation).

In other words, if you are not concerned with precision of generated probability values with sensitivity of less than 1e-7, or a related convergence issue observed in your experiments, then you should not worry too much; just use the sigmoid and binary cross-entropy as before, i.e. model.compile(loss='binary_crossentropy', ...), and it would work fine.

All in all, if you are really concerned with numerical stability, you can take the safest path and use from_logits=True without using any activation function on the last layer of the model.


Now, to answer the original question, the true labels or target values (i.e. y_true) should be still only zeros or ones when using BinaryCrossentropy(from_logits=True). Rather, that's the y_pred (i.e. the output of the model) which should not be a probability distribution in this case (i.e. the sigmoid function should not be used on the last layer if from_logits=True).

today
  • 32,602
  • 8
  • 95
  • 115
3

I tested GAN on recovering realistic image from sketch and the only difference between two train cycles was BinaryCrossentropy(from_logits=True/False). Last network layer is Conv2D with no activation, so the right choice should be from_logits=True, but for experimental purposes - I found huge difference in generator and discriminator loss

  • orange - True,
  • blue - False.

Here is the link to collab notebook. Exercise based on Tensorflow tutorial pix2pix.

loss difference between two runs

According to exercise description if from_logits=True

  • The value log(2) = 0.69 is a good reference point for these losses, as it indicates a perplexity of 2: That the discriminator is on average equally uncertain about the two options.
  • For the disc_loss a value below 0.69 means the discriminator is doing better than random, on the combined set of real+generated images.
  • For the gen_gan_loss a value below 0.69 means the generator i doing better than random at foolding the descriminator.

Otherwise loss twice higher for both: generator and discriminator. SImilar explanation doesn't look to hold relevance anymore.

Final images are also different:

  • In case of from_logits==False , image looks blurry and non-realistic enter image description here