Should the custom loss function in Keras return a single loss value for the batch or an arrary of losses for every sample in the training batch?

Question

I'm learning keras API in tensorflow(2.3). In this guide on tensorflow website, I found an example of custom loss funciton:

    def custom_mean_squared_error(y_true, y_pred):
        return tf.math.reduce_mean(tf.square(y_true - y_pred))

The reduce_mean function in this custom loss function will return an scalar.

Is it right to define loss function like this? As far as I know, the first dimension of the shapes of y_true and y_pred is the batch size. I think the loss function should return loss values for every sample in the batch. So the loss function shoud give an array of shape (batch_size,). But the above function gives a single value for the whole batch.

Maybe the above example is wrong? Could anyone give me some help on this problem?

p.s. Why do I think the loss function should return an array rather than a single value?

I read the source code of Model class. When you provide a loss function (please note it's a function, not a loss class) to Model.compile() method, ths loss function is used to construct a LossesContainer object, which is stored in Model.compiled_loss. This loss function passed to the constructor of LossesContainer class is used once again to construct a LossFunctionWrapper object, which is stored in LossesContainer._losses.

According to the source code of LossFunctionWrapper class, the overall loss value for a training batch is calculated by the LossFunctionWrapper.__call__() method (inherited from Loss class), i.e. it returns a single loss value for the whole batch. But the LossFunctionWrapper.__call__() first calls the LossFunctionWrapper.call() method to obtain an array of losses for every sample in the training batch. Then these losses are fianlly averaged to get the single loss value for the whole batch. It's in the LossFunctionWrapper.call() method that the loss function provided to the Model.compile() method is called.

That's why I think the custom loss funciton should return an array of losses, insead of a single scalar value. Besides, if we write a custom Loss class for the Model.compile() method, the call() method of our custom Loss class should also return an array, rather than a signal value.

I opened an issue on github. It's confirmed that custom loss function is required to return one loss value per sample. The example will need to be updated to reflect this.

today · Answer 1 · 2020-08-19T07:49:03.110

10

Actually, as far as I know, the shape of return value of the loss function is not important, i.e. it could be a scalar tensor or a tensor of one or multiple values per sample. The important thing is how it should be reduced to a scalar value so that it could be used in optimization process or shown to the user. For that, you can check the reduction types in Reduction documentation.

Further, here is what the compile method documentation says about the loss argument, partially addressing this point:

loss: String (name of objective function), objective function or tf.keras.losses.Loss instance. See tf.keras.losses. An objective function is any callable with the signature loss = fn(y_true,y_pred), where y_true = ground truth values with shape = [batch_size, d0, .. dN], except sparse loss functions such as sparse categorical crossentropy where shape = [batch_size, d0, .. dN-1]. y_pred = predicted values with shape = [batch_size, d0, .. dN]. It returns a weighted loss float tensor. If a custom Loss instance is used and reduction is set to NONE, return value has the shape [batch_size, d0, .. dN-1] ie. per-sample or per-timestep loss values; otherwise, it is a scalar. If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses. The loss value that will be minimized by the model will then be the sum of all individual losses.

In addition, it's worth noting that most of the built-in loss functions in TF/Keras are usually reduced over the last dimension (i.e. axis=-1).

For those who doubt that a custom loss function which returns a scalar value would work: you can run the following snippet and you will see that the model would train and converge properly.

import tensorflow as tf
import numpy as np

def custom_loss(y_true, y_pred):
    return tf.reduce_sum(tf.square(y_true - y_pred))

inp = tf.keras.layers.Input(shape=(3,))
out = tf.keras.layers.Dense(3)(inp)

model = tf.keras.Model(inp, out)
model.compile(loss=custom_loss, optimizer=tf.keras.optimizers.Adam(lr=0.1))

x = np.random.rand(1000, 3)
y = x * 10 + 2.5
model.fit(x, y, epochs=20)

edited Aug 19 '20 at 07:49

answered Aug 19 '20 at 07:03

today

32,602
8
95
115

1

Yes, you are right. The `Loss.__call__()` method calls the `compute_weighted_loss` function to reduce the losses for every example to a scalar loss for the training batch. We can't change this behavior unless we define a subclass of `Loss` and rewrite the `__call__()` method. But when we provide our custom loss function, it should return an array of losses for `compute_weighted_loss` to calculate the average. – Gödel Aug 19 '20 at 07:18
As to the built-in loss functions, if `y_true` and `y_pred` have the shape `(batch_size, output_dimension)`, then those loss function just return a tensor of the shape `(batch_size,)`, i.e., one loss per sample. If `y_true` and `y_pred` have more than two dimensions, it may have time steps in the output, just like the RNN/LSTM layer. – Gödel Aug 19 '20 at 07:24
That's not correct. This has nothing to do with subclassing `Loss` or defining a custom loss function. You can try it yourself: implement a dummy model and define a custom loss function which returns a scalar value as the loss; you will see that the model would train and converge properly. – today Aug 19 '20 at 07:44
1

@Gödel I just added a minimal example of a model which uses a loss function with scalar return value at the end of my answer. You can try it yourself to see it trains and converges properly. – today Aug 19 '20 at 07:50
I know you can train the model even if your custorm loss funtion returns a scalar. It just means that the code does not check the shape of the return value of the loss function. But logically the loss value for a training batch should be an average of the losses of each sample in the batch. – Gödel Aug 19 '20 at 10:04
In addition, what if you want to calculate weighted average of per sample losses as the loss value of a training batch? You can't provide the weights to your custom loss function. You can check that the "sample_weight" is finally used in `Loss.__call__()` method, not in your custom loss function. – Gödel Aug 19 '20 at 10:12
@Gödel I didn't say that using a loss function with a scalar return value would cover all the various cases (e.g. supporting sample weights). I just said that it's possible and a valid thing to do. Of course, as I mentioned earlier, usually the reasonable thing to do is a per-sample loss value. But there is nothing wrong (in terms of model training) with a loss function with a scalar return value. – today Aug 19 '20 at 10:26
Right. At present you can utilize this behavior of the `Loss` class( i.e., it doesn't check the shape of the return value of custom loss function). But in the future, if the `Loss.__call__()` method does check it, this may cause problem. But now, let's just define our custom loss function this way~ – Gödel Aug 19 '20 at 10:37

score 8 · Accepted Answer · answered Aug 19 '20 at 06:46

8

I opened an issue on github. It's confirmed that custom loss function is required to return one loss value per sample. The example will need to be updated to reflect this.

answered Aug 19 '20 at 06:46

Gödel

592
4
21

I don't think the TF devs are right there. There is no explicit or logical requirement for the loss function to return a per-sample loss (although, that's a very reasonable thing to do). As the documentation also confirms this, the loss function can return a scalar value as well and the model will be trained without any problems. – today Aug 19 '20 at 07:13
It's because the scalar is passed to the `compute_weighted_loss` function. It doesn't cause problem. But the method to calculate the loss value for the training batch is wrong. – Gödel Aug 19 '20 at 07:28

score 6 · Answer 3 · answered Dec 03 '20 at 14:47

I think the question posted by @Gödel is totally legit and is correct. The custom loss function should return a loss value per sample. And, an explanation provided by @today is also correct. In the end, it all depends on the kind of reduction used.

So if one uses class API to create a loss function, then, reduction parameter is automatically inherited in the custom class. Its default value "sum_over_batch_size" is used (which is simply averaging of all the loss values in a given batch). Other options are "sum", which computes a sum instead of averaging and the last option is "none", where an array of loss values are returned.

It is also mentioned in the Keras documentation that these differences in reduction are irreverent when one is using model.fit() because reduction is then automatically handled by TF/Keras.

And, lastly, it is also mentioned that when a custom loss function is created, then, an array of losses (individual sample losses) should be returned. Their reduction is handled by the framework.

Links:

https://keras.io/api/losses/
Checkout CategoricalCrossentropy Class: https://keras.io/api/losses/probabilistic_losses/#categoricalcrossentropy-class

score 3 · Answer 4 · answered Aug 13 '20 at 08:50

3

The tf.math.reduce_mean takes the average for the batch and returns it. That's why it is a scalar.

answered Aug 13 '20 at 08:50

Abhishek Verma

1,671
1
8
12

I know it's a scalar. But I think the loss funciton should return an array of losses for every sample in the batch, not a scalar for the whole batch. – Gödel Aug 13 '20 at 09:26
That's what I have written why it is returning a scalar, because a mean is being taken. And it should return a scalar only because for backpropagation you need a single value and not an array. – Abhishek Verma Aug 13 '20 at 18:31
But according to the [source code](https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/python/keras/losses.py), the loss function actually should return an array of losses for every sample in a batch. For example the `mean_squared_error` function in the source code will return an array, not a scalar. The `call()` method of `LossFunctionWrapper` also returns loss value for each sample. The `__call__()` method of an `Loss` object will use the `call()` method or a loss function to get loss values for every sample, then average those losses to get the loss of the whole batch. – Gödel Aug 14 '20 at 01:13
`reduce_sum` is being used here. The initial comments show that. – Abhishek Verma Aug 14 '20 at 06:12
Well, what should a LOSS function return given `y_true` and `y_pred`? – Gödel Aug 14 '20 at 07:14
The mean loss is required. – Abhishek Verma Aug 14 '20 at 07:29
When you provide a loss function to the `Model.compile()` method, this loss function will be converted to a `Loss` object. The `Loss.__call__()` method uses `Loss.call()` method to get an array of losses for each sample, then get the average loss for the batch. The problem is, `Loss.call()` method use the loss function, so I think the loss function you provide to the `Model.compile()` method should return an array of losses for each sample, not the mean loss. – Gödel Aug 14 '20 at 08:13
Look at the code, it is taking the mean. So, just look at the code and don't think. – Abhishek Verma Aug 14 '20 at 08:18
Take a look at the algorithm of backpropagation on a batch. There you will definitely know what the algorithm needs. – Abhishek Verma Aug 14 '20 at 08:20
It's bacause I read the source code that I thought the example of custom loss function given in the Tensorflow website was wrong. – Gödel Aug 14 '20 at 08:33

score 3 · Answer 5 · answered Aug 17 '20 at 12:23

3

The loss function given on Tensorflow website is absolutely correct.

def custom_mean_squared_error(y_true, y_pred):
    return tf.math.reduce_mean(tf.square(y_true - y_pred))

In machine learning, the loss we use is sum of losses of individual training examples, so it should be a scalar value. (Since for all the examples, we are using a single network, thus we need a single loss value to update the parameters.)

Regarding making containers for losses:

When using parallel computation, making container is a simpler and feasible way to keep track of indices of losses computed as we are using batches to train and not the whole training set.

answered Aug 17 '20 at 12:23

Rahul Vishwakarma

1,446
2
7
22

And in this [post](https://towardsdatascience.com/how-to-create-a-custom-loss-function-keras-3a89156ec69b), the author also said that "Loss function should always return a vector of length batch_size, Because you have to return a loss for each datapoint". – Gödel Aug 18 '20 at 02:00
In the souece code of [losses](https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/python/keras/losses.py) module, the `MeanAbsoluteError` class uses the `mean_squared_error` function to construct a `LossFunctionWrapper` class. You can check that the `mean_squared_error` function returns `K.mean(math_ops.squared_difference(y_pred, y_true), axis=-1)`, which is an array, not a single value. – Gödel Aug 18 '20 at 02:10
I know that when training the model we need a single loss value for the whole batch. But according to the source code, our custom loss function is not responsible for obtaining that single loss value. It is the `LossFunctionWrapper.__call__()` method that calculates the mean loss of all individual training samples. The `LossFunctionWrapper.__call__()` method calls `LossFunctionWrapper.call()` method to get losses for individual samples. It's in the `LossFunctionWrapper.call()` method that our custom loss function is called. Did you read the source code I mentionened above? – Gödel Aug 18 '20 at 02:17

score 1 · Answer 6 · answered Dec 19 '22 at 12:56

The tensorflow documentation missed it, but this is clearly stated and clarified on the Keras documentation. It says:

Note that this is an important difference between loss functions like tf.keras.losses.mean_squared_error and default loss class instances like tf.keras.losses.MeanSquaredError: the function version does not perform reduction, but by default the class instance does.

And it also states:

By default, loss functions return one scalar loss value per input sample.

score 0 · Answer 7 · answered Oct 01 '20 at 00:43

0

The dimensionality can be increased because of multiple channels...however, each channel should only have a scalar value for loss.

answered Oct 01 '20 at 00:43

goodcow

4,495
6
33
52

Should the custom loss function in Keras return a single loss value for the batch or an arrary of losses for every sample in the training batch?

7 Answers7

The loss function given on Tensorflow website is absolutely correct.

Regarding making containers for losses:

Linked