6

When I read the guides in the websites of Tensorflow , I find two ways to custom losses. The first one is to define a loss function,just like:

def basic_loss_function(y_true, y_pred):
    return tf.math.reduce_mean(tf.abs(y_true - y_pred))

And for the sake of simplicity, we assume the batch size is also 1, so the shape of y_true and y_pred are both (1, c), where c is the number of classes. So in this method, we give two vectors y_true and y_pred, and return a value(scala).

Then, the second method is to subclass tf.keras.losses.Loss class, and the code in guide is:

class WeightedBinaryCrossEntropy(keras.losses.Loss):
    """
    Args:
      pos_weight: Scalar to affect the positive labels of the loss function.
      weight: Scalar to affect the entirety of the loss function.
      from_logits: Whether to compute loss from logits or the probability.
      reduction: Type of tf.keras.losses.Reduction to apply to loss.
      name: Name of the loss function.
    """
    def __init__(self, pos_weight, weight, from_logits=False,
                 reduction=keras.losses.Reduction.AUTO,
                 name='weighted_binary_crossentropy'):
        super().__init__(reduction=reduction, name=name)
        self.pos_weight = pos_weight
        self.weight = weight
        self.from_logits = from_logits

    def call(self, y_true, y_pred):
        ce = tf.losses.binary_crossentropy(
            y_true, y_pred, from_logits=self.from_logits)[:,None]
        ce = self.weight * (ce*(1-y_true) + self.pos_weight*ce*(y_true))
        return ce

In the call method, as usual, we give two vectors y_true and y_pred, but I notice that it return ce, which is a VECTOR with shape (1, c) !!!

So is there any problem in the above toy example ? Or Tensorflow2.x has some magic behind that ?

digger
  • 121
  • 2
  • 4

1 Answers1

2

The main difference between the two aside from implementation is the type of the loss functions. The first one is L1 loss (average of absolute differences by definition, used for mostly regression like problems), while the second is binary crossentropy (used for classification). They are not meant to be different implementations of the same loss, and this is stated in the guide you linked.

Binary crossentropy in a multi-label, multi-class classification setting outputs a value for every class, as if they were independent of each other.

Edit:

In the second loss function the reduction parameter controls the way the output is aggregated, eg. taking the sum of elements or summing over the batch etc. By default, your code uses keras.losses.Reduction.AUTO, which translates into summing over the batch if you check the source code. This means, the final loss will be a vector, but there are other reductions available, you can check them in the docs. I believe even if you do not define the reduction to take the sum of the loss elements in the loss vector, TF optimizers will do so, to avoid errors from backpropagating a vector. Backpropagation on a vector would cause problems at weights that "contribute" to every loss element. However, I have not checked this in the source code. :)

Andrea Angeli
  • 745
  • 1
  • 5
  • 14
  • Yes, they are two different losses. But I think whether the call method should just return a scala rather than a vector when the batch size eauals to 1 ? – digger May 16 '20 at 07:17
  • They are two different losses for two different things that is why the output shape is different. One returns a scalar by definition (L1), while the other returns a scalar for every class by definition in a multiclass setting (c = number of classes). Even is you did not subclass tf.keras.Loss, binary crossentropy should return a vector of shape (1,c). – Andrea Angeli May 16 '20 at 07:25
  • Alternatively you could use categorical crossentropy, which would return a scalar. – Andrea Angeli May 16 '20 at 07:31
  • Thanks a lot for your detailed explanation ! But in my thoughts, the loss of a neural network should be a single scalar rathrer than a vector because of BP. So I wonder that is it necessary to do sum(ce) to get the total losses over all classes for the forward pass? – digger May 17 '20 at 06:41
  • That is correct, backpropagating a vector would probably cause issues, because the optimizers wouldn't be able to decide which component to use of the gradient vectors. I updated my answer regarding loss reduction in subclassed losses. Funny, I actually deleted this part in my original answer. Also, I incorrectly used "multiclass classification", when I meant "multilabel". – Andrea Angeli May 17 '20 at 08:15
  • Thanks again for your detailed explanation . Indeed, I have tried to track the flow of program, but I lost myself in the endless wrap of these classes and methods. Maybe the logic hides in some low-level code. – digger May 17 '20 at 14:11