I have a multi-label classification in which each target is a vector of ones and zeros not mutually exclusive (for the sake of clarity, my target is something like [0, 1, 0, 0, 1, 1, ... ]
).
My understanding so far is:
I should use a binary cross-entropy function. (as explained in this answer)
Also, I understood that
tf.keras.losses.BinaryCrossentropy()
is a wrapper around tensorflow'ssigmoid_cross_entropy_with_logits
. This can be used either withfrom_logits
True
orFalse
. (as explained in this question)Since
sigmoid_cross_entropy_with_logits
performs itself the sigmoid, it expects the input to be in the [-inf,+inf] range.tf.keras.losses.BinaryCrossentropy()
, when the network implements itself a sigmoid activation of the last layer, must be used withfrom_logits=False
. It will then infert the sigmoid function and pass the output tosigmoid_cross_entropy_with_logits
that will do the sigmoid again. This however can cause numerical issues due to the asymptotes of the sigmoid/logit function.To improve the numerical stability, we can avoid the last sigmoid layer and use
tf.keras.losses.BinaryCrossentropy(from_logits=False)
Question:
If we use tf.keras.losses.BinaryCrossentropy(from_logits=False)
, what target should I use? Do I need to change my target for the one-hot vector?
I suppose that I should apply then a sigmoid activation to the network output at inference time. Is there a way to add a sigmoid layer active only in inference mode and not in training mode?