When writing a custom loss function, should I use tf.reduce_mean, and if so how? Does it ever matter?

Question

The sample code below shows that all the following give the same (correct) results when writing a custom loss function (calculating mean_squared_error) for a simple linear regression model.

Do not use tf_reduce_mean() (so returning a loss for each example)
Use tf_reduce_mean() (so returning a single loss)
Use tf_reduce_mean(..., axis-1)

Is there any reason to prefer one approach to another, and are there any circumstances where it makes a difference?

(There is, for example sample code at Make a custom loss function in keras that suggests axis=-1 should be used)

import numpy as np
import tensorflow as tf

# Create simple dataset to do linear regression on
# The mean squared error (~ best achievable MSE loss after fitting linear regression) for this dataset is 0.01
xtrain = np.random.randn(5000)                  # Already normalized
ytrain = xtrain + np.random.randn(5000) * 0.1   # Close enough to being normalized

# Function to create model and fit linear regression, and report final loss
def cre_and_fit(loss="mean_squared_error", lossdescription="",epochs=20):
    model = tf.keras.models.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))]) 
    model.compile(loss=loss, optimizer="RMSProp")
    history = model.fit(xtrain, ytrain, epochs=epochs, verbose=False)
    print(f"Final loss value for {lossdescription}: {history.history['loss'][-1]:.4f}")

# Result from standard MSE loss ~ 0.01
cre_and_fit("mean_squared_error","Keras standard MSE")

# This gives the right result, not reducing. Return shape = (batch_size,)
cre_and_fit(lambda y_true, y_pred: (y_true-y_pred)*(y_true-y_pred),
               "custom loss, not reducing over batch items" )

# This also gives the right result, reducing over batch items. Return shape = ()
cre_and_fit(lambda y_true, y_pred: tf.reduce_mean((y_true-y_pred)*(y_true-y_pred) ),
               "custom loss, reducing over batch items")

# How about using axis=-1? Also gives the same result
cre_and_fit(lambda y_true, y_pred: tf.reduce_mean((y_true-y_pred)*(y_true-y_pred), axis=-1),
               "custom loss, reducing with axis=-1" )

bui · Accepted Answer · 2022-08-23T10:01:24.413

When you pass a lambda (or a callable in general) to compile and call fit, TF will wrap it inside a LossFunctionWrapper, which is a subclass of Loss, with a default reduction type of ReductionV2.AUTO. Note that a Loss object always has a reduction type representing how it will reduce the loss tensor to a single scalar.

Under most circumstances, ReductionV2.AUTO translates to ReductionV2.SUM_OVER_BATCH_SIZE which, despite its name, actually performs reduced mean over all axis on the underlying lambda's output.

import tensorflow as tf
from keras import losses as losses_mod
from keras.utils import losses_utils

a = tf.random.uniform((10,2))
b = tf.random.uniform((10,2))

l_auto = losses_mod.LossFunctionWrapper(fn=lambda y_true, y_pred : tf.square(y_true - y_pred), reduction=losses_utils.ReductionV2.AUTO)
l_sum = losses_mod.LossFunctionWrapper(fn=lambda y_true, y_pred : tf.square(y_true - y_pred), reduction=losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE)

l_auto(a,b).shape.rank == l_sum(a,b).shape.rank == 0  # rank 0 means scalar 
l_auto(a,b) == tf.reduce_mean(tf.square(a - b))  # True
l_sum(a,b) == tf.reduce_mean(tf.square(a - b))   # True

So to answer your question, the three options are equivalent since they all eventually result in a single scalar that is the mean of all elements in the raw tf.square(a - b) loss tensor. However, should you wish to perform an operation other than reduce_mean e.g., reduce_sum, in the lambda, then the three will yield different results:

l1 = losses_mod.LossFunctionWrapper(fn=lambda y_true, y_pred : tf.square(y_true - y_pred), 
                                    reduction=losses_utils.ReductionV2.AUTO)
l2 = losses_mod.LossFunctionWrapper(fn=lambda y_true, y_pred : tf.reduce_sum(tf.square(y_true - y_pred)), 
                                    reduction=losses_utils.ReductionV2.AUTO)
l3 = losses_mod.LossFunctionWrapper(fn=lambda y_true, y_pred : tf.reduce_sum(tf.square(y_true - y_pred), axis=-1), 
                                    reduction=losses_utils.ReductionV2.AUTO)
l1(a,b) == tf.reduce_mean(tf.square(a-b))  # True
l2(a,b) == tf.reduce_sum(tf.square(a-b))   # True
l3(a,b) == tf.reduce_mean(tf.reduce_sum(tf.square(a-b), axis=-1))   # True

Concretely, l2(a,b) == tf.reduce_mean(tf.reduce_sum(tf.square(a-b))), but that is just tf.reduce_sum(tf.square(a-b)) since mean of a scalar is itself.

Thank you, this is very helpful. I'll take from this, I think, that it's probably advisable to reduce to a scalar in the loss function, since that's the most explicit way of coding it, the way that least relies on behind-the-scenes tf default action. Explicit is better than implicit! — David Harris, Aug 24 '22 at 07:57
That's not always the case. When you have sample weights to pass to the wrapper (along side y_true and y_pred), the Loss base class will apply the sample weights to the lambda's output before applying reduction, so if your lambda already reduces the loss to a scalar, the sample weights will lose their meaning. The best practice imo is to reduce the loss inside the lambda until the loss shape matches with the sample weights shape. — bui, Aug 24 '22 at 08:21

When writing a custom loss function, should I use tf.reduce_mean, and if so how? Does it ever matter?

1 Answers1

Linked