Loss clipping in tensor flow (on DeepMind's DQN)

Question

I am trying my own implementation of the DQN paper by Deepmind in tensor flow and am running into difficulty with clipping of the loss function.

Here is an excerpt from the nature paper describing the loss clipping:

We also found it helpful to clip the error term from the update to be between −1 and 1. Because the absolute value loss function |x| has a derivative of −1 for all negative values of x and a derivative of 1 for all positive values of x, clipping the squared error to be between −1 and 1 corresponds to using an absolute value loss function for errors outside of the (−1,1) interval. This form of error clipping further improved the stability of the algorithm.

(link to full paper: http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html)

What I have tried so far is using

clipped_loss_vec = tf.clip_by_value(loss, -1, 1)

to clip the loss I calculate between -1 and +1. The agent is not learning the proper policy in this case. I printed out the gradients of the network and realized that if the loss falls below -1, the gradients all suddenly turn to 0!

My reasoning for this happening is that the clipped loss is a constant function in (-inf,-1) U (1,inf), which means it has zero gradient in those regions. This in turn ensures that the gradients throughout the network are zero (think of it as, whatever input image I provide the network, the loss stays at -1 in the local neighborhood because it has been clipped).

So, my question is two parts:

What exactly did Deepmind mean in the excerpt? Did they mean that the loss below -1 is clipped to -1 and above +1 is clipped to +1. If so, how did they deal with the gradients (i.e. what is all that part about absolute value functions?)
How should I implement loss clipping in tensor flow such that the gradients do not go to zero outside the clipped range (but maybe stay at +1 and -1)? Thanks!

fleibfried's answer is correct. However, several DQN implementations out there do clip the loss from -1 to 1. This works because the game rewards are also clipped from -1 to 1, which alleviates the issue. — BlueMoon93, Nov 23 '16 at 16:59

score 16 · Answer 1 · answered Jul 13 '16 at 23:09

I suspect they mean that you should clip the gradient to [-1,1], not clip the loss function. Thus, you compute the gradient as usual, but then clip each component of the gradient to be in the range [-1,1] (so if it is larger than +1, you replace it with +1; if it is smaller than -1, you replace it with -1); and then you use the result in the gradient descent update step instead of using the unmodified gradient.

Equivalently: Define a function f as follows:

f(x) = x^2          if x in [-0.5,0.5]
f(x) = |x| - 0.25   if x < -0.5 or x > 0.5

Instead of using something of the form s^2 as the loss function (where s is some complicated expression), they suggest to use f(s) as the loss function. This is some kind of hybrid between squared-loss and absolute-value-loss: will behave like s^2 when s is small, but when s gets larger, it will behave like the absolute value (|s|).

Notice that the derivative of f has the nice property that its derivative will always be in the range [-1,1]:

f'(x) = 2x    if x in [-0.5,0.5]
f'(x) = +1    if x > +1
f'(x) = -1    if x < -1

Thus, when you take the gradient of this f-based loss function, the result will be the same as computing the gradient of a squared-loss and then clipping it.

Thus, what they're doing is effectively replacing a squared-loss with a Huber loss. The function f is just two times the Huber loss for delta = 0.5.

Now the point is that the following two alternatives are equivalent:

Use a squared loss function. Compute the gradient of this loss function, but the gradient to [-1,1] before doing the update step of the gradient descent.
Use a Huber loss function instead of a squared loss function. Compute the gradient of this loss function directly (unchanged) in the gradient descent.

The former is easy to implement. The latter has nice properties (improves stability; it's better than absolute-value-loss because it avoids oscillating around the minimum). Because the two are equivalent, this means we get an easy-to-implement scheme that has the simplicity of squared-loss with the stability and robustness of the Huber loss.

I think the 2 alternatives have a subtle but important difference. When using a DQN, the weights w are updated by taking the gradient of the loss. So for Huber loss, the gradient would be either d(Q_s,a)/dw or -d(Q_s,a)/dw outside [-0.5,0.5]. This is because the gradient of |loss| = +1 or -1. Using a squared loss function and clipping the gradient means that this whole resulting gradient is limited to [-1,+1] and not only the first term of the gradient. Thus, to get similar results to the DQN paper, I think one should use Huber loss. — RaviTej310, Nov 02 '20 at 02:53

renatoc · Answer 2 · 2017-05-01T14:27:04.997

First of all, the code for the paper is available online, which constitutes an invaluable reference.

Part 1

If you take a look at the code you will see that, in nql:getQUpdate (NeuralQLearner.lua, line 180), they clip the error term of the Q-learning function:

-- delta = r + (1-terminal) * gamma * max_a Q(s2, a) - Q(s, a)
if self.clip_delta then
    delta[delta:ge(self.clip_delta)] = self.clip_delta
    delta[delta:le(-self.clip_delta)] = -self.clip_delta
end

Part 2

In TensorFlow, assuming the last layer of your neural network is called self.output, self.actions is a one-hot encoding of all actions, self.q_targets_ is a placeholder with the targets, and self.q is your computed Q:

# The loss function
one = tf.Variable(1.0)
delta = self.q - self.q_targets_
absolute_delta = tf.abs(delta)
delta = tf.where(
    absolute_delta < one,
    tf.square(delta),
    tf.ones_like(delta) # squared error: (-1)^2 = 1
)

Or, using tf.clip_by_value (and having an implementation closer to the original):

delta = tf.clip_by_value(
    self.q - self.q_targets_,               
    -1.0,                
    +1.0                 
)

Seeing a very recent TF issue: https://github.com/tensorflow/tfjs/issues/338, tf.where cannot come into play since it does not flow the gradient. I personally applied tf.clip_by_value as well, but it neither works — sdr2002, May 27 '18 at 23:44
From the issue, seems like this is fixed now. Still, it looks like Huber loss seems to be the correct way to implement this. — renatoc, Apr 22 '19 at 17:26

fleibfried · Answer 3 · 2016-10-26T14:14:42.263

No. They talk about error clipping actually, not about loss clipping which is however as far as I know referring to the same thing but leads to confusion. They DO NOT mean that the loss below -1 is clipped to -1 and the loss above +1 is clipped to +1 because that leads to zero gradients outside the error range [-1;1] as you realized. Instead, they suggest to use a linear loss instead of a quadratic loss for error values < -1 and error values > 1.
Compute the error value (r + \gamma \max_{a'} Q(s',a'; \theta_i^-) - Q(s,a; \theta_i)). If this error value is within the range [-1;1], square it, if the error value is < -1 multiply by -1, if the error value is > 1 leave it as it is. If you use this as loss function the gradients outside the interval [-1;1] won't vanish.

In order to have a "smooth-looking" compound loss function you could also replace the squared loss outside the error range [-1;1] with a first-order Taylor approximation at the border values -1 and 1. In this case, if e was your error value, you would square it in case e \in [-1;1], in case e < -1, replace it by -2e-1, in case e > 1, replace it by 2e-1.

score 1 · Answer 4 · answered Mar 23 '17 at 19:31

In the Deep Mind paper you reference, they limit the gradient of the loss. This prevents giant gradients and so improves robustness. They do this by using a quadratic loss function for errors inside a small range, and using an absolute value loss for larger errors.

I suggest implementing the Huber loss function. Below is a python tensorflow implementation.

def huber_loss(y_true, y_pred, max_grad=1.):
    """Calculates the huber loss.

    Parameters
    ----------
    y_true: np.array, tf.Tensor
      Target value.
    y_pred: np.array, tf.Tensor
      Predicted value.
    max_grad: float, optional
      Positive floating point value. Represents the maximum possible
      gradient magnitude.

    Returns
    -------
    tf.Tensor
      The huber loss.
    """
    err = tf.abs(y_true - y_pred, name='abs')
    mg = tf.constant(max_grad, name='max_grad')

    lin = mg*(err-.5*mg)
    quad=.5*err*err

    return tf.where(err < mg, quad, lin)

Loss clipping in tensor flow (on DeepMind's DQN)

4 Answers4

Part 1

Part 2

Linked