117

Considering the example code.

I would like to know How to apply gradient clipping on this network on the RNN where there is a possibility of exploding gradients.

tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)

This is an example that could be used but where do I introduce this ? In the def of RNN

    lstm_cell = rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
    # Split data because rnn cell needs a list of inputs for the RNN inner loop
    _X = tf.split(0, n_steps, _X) # n_steps
tf.clip_by_value(_X, -1, 1, name=None)

But this doesn't make sense as the tensor _X is the input and not the grad what is to be clipped?

Do I have to define my own Optimizer for this or is there a simpler option?

Nicolas Gervais
  • 33,817
  • 13
  • 115
  • 143
Arsenal Fanatic
  • 3,663
  • 6
  • 38
  • 53

8 Answers8

155

Gradient clipping needs to happen after computing the gradients, but before applying them to update the model's parameters. In your example, both of those things are handled by the AdamOptimizer.minimize() method.

In order to clip your gradients you'll need to explicitly compute, clip, and apply them as described in this section in TensorFlow's API documentation. Specifically you'll need to substitute the call to the minimize() method with something like the following:

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gvs = optimizer.compute_gradients(cost)
capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
train_op = optimizer.apply_gradients(capped_gvs)
Styrke
  • 2,606
  • 1
  • 21
  • 17
  • 4
    Styrke, thanks for the post. Do you know what the next steps are to actually run a iteration of the optimizer? Typically, an optimizer is instantiated as `optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost) ` and then an iteration of optimizer is done as `optimizer.run()` but using `optimizer.run()` does not seem to work in this case? – applecider Apr 24 '16 at 03:57
  • 6
    Ok got it `optimizer.apply_gradients(capped_gvs)` that needs to be assigned to something `x = optimizer.apply_gradients(capped_gvs)` then within your session you can train as `x.run(...)` – applecider Apr 24 '16 at 04:05
  • 5
    Shout-out to @remi-cuingnet for the [nice edit suggestion](http://stackoverflow.com/review/suggested-edits/12543496). (Which unfortunately was rejected by hasty reviewers) – Styrke Jun 01 '16 at 13:37
  • This gives me `UserWarning: Converting sparse IndexedSlices to a dense Tensor with 148331760 elements. This may consume a large amount of memory.` So somehow my sparse gradients are converted to dense. Any idea how to overcome this problem? – Pekka Sep 04 '16 at 08:17
  • @Pekka Take a look at the answers to [this question](http://stackoverflow.com/questions/35892412/tensorflow-dense-gradient-explanation). In this specific case I suspect that the warning is caused by `tf.clip_by_value()` not supporting sparse IndexedSlices, but I haven't checked. – Styrke Nov 11 '16 at 16:39
  • In case you have problems with `None` gradients, look here: http://stackoverflow.com/questions/39295136/gradient-clipping-appears-to-choke-on-none – patapouf_ai Feb 19 '17 at 20:54
  • @Styrke and others: I know this answer has 30 votes as of right now, but is this is safe? The `AdamOptimizer` and other fancy optimizers can keep track of many things, like momentum, I would like to know if those features are working normally while clipping gradients this way. This answer looks simple to me, I fear that something might be missing. – Guillaume Chevalier Apr 01 '17 at 18:10
  • 8
    Actually the right way to clip gradients (according to tensorflow docs, computer scientists, and logic) is with `tf.clip_by_global_norm`, as suggested by @danijar – gdelab Jun 29 '17 at 07:40
  • If `grad` (Gradient) is `nan` then NEITHER `tf.clip_by_value()` or `tf.clip_by_global_norm()` would work right? – S.Perera Mar 01 '21 at 17:19
  • 404 on the link – John Glen Mar 21 '22 at 11:24
130

Despite what seems to be popular, you probably want to clip the whole gradient by its global norm:

optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimize = optimizer.apply_gradients(zip(gradients, variables))

Clipping each gradient matrix individually changes their relative scale but is also possible:

optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients = [
    None if gradient is None else tf.clip_by_norm(gradient, 5.0)
    for gradient in gradients]
optimize = optimizer.apply_gradients(zip(gradients, variables))

In TensorFlow 2, a tape computes the gradients, the optimizers come from Keras, and we don't need to store the update op because it runs automatically without passing it to a session:

optimizer = tf.keras.optimizers.Adam(1e-3)
# ...
with tf.GradientTape() as tape:
  loss = ...
variables = ...
gradients = tape.gradient(loss, variables)
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimizer.apply_gradients(zip(gradients, variables))
danijar
  • 32,406
  • 45
  • 166
  • 297
  • how does `tf.clip_by_global_norm` handle `None` gradients? – gokul_uf Apr 30 '17 at 19:26
  • @gokul_uf It automatically ignores `None` gradients. – danijar May 03 '17 at 03:29
  • 11
    Good example with `clip_by_global_norm()`! This is also described as `the correct way to perform gradient clipping` in tensorflow docs: https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/clip_by_global_norm – MZHm Jun 14 '17 at 13:57
  • thanks! why 5.0 as the clip_norm? how did you choose that value? – Escachator Aug 23 '17 at 14:47
  • 9
    @Escachator It's empirical and will depend on your model and possibly the task. What I do is to visualize the gradient norm `tf.global_norm(gradients)` to see it's usual range and then clip a bit above that to prevent outliers from messing up the training. – danijar Aug 23 '17 at 16:02
  • 1
    would you still call `opt.minimize()` after or would you call something different like `opt.run()` as is suggested in some of the comments on other answers? – reese0106 Feb 05 '18 at 13:56
  • 3
    @reese0106 No, `optimizer.minimize(loss)` is just a shorthand for computing and applying the gradients. You can run the example in my answer with `sess.run(optimize)`. – danijar Feb 05 '18 at 14:03
  • 1
    So if I were using `tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)` within an experiment function, then your `optimize` would replace my `train_op` correct? Right now my `train_op = optimizer.minimize(loss, global_step=global_step))` so I'm trying to make sure I adjust accordingly... – reese0106 Feb 05 '18 at 16:30
  • @reese0106 Exactly! – danijar Feb 05 '18 at 17:30
  • Last question, how would you suggest that I incorporate the global_step if it is not done within .minimize()? – reese0106 Feb 08 '18 at 20:28
  • @danijar I think you still have to handle the None values. Optimizer.apply_gradients() throws the 'None values not supported error' when there are trainable variables that the loss is not dependent upon. Can somebody confirm? – figs_and_nuts Mar 13 '18 at 11:41
  • z = tf.get_variable(name = 'z', shape = [1]); b = tf.get_variable('b', [1]); c = b*b - 2*b + 1; optimizer = tf.train.AdamOptimizer(0.1); gradients, variables = zip(*optimizer.compute_gradients(c)); #gradients = tf.clip_by_global_norm(gradients, 2.5); train_op = optimizer.apply_gradients(zip(gradients, variables)); .... comment out the clipping statement to see the difference – figs_and_nuts Mar 13 '18 at 11:42
  • The Adam optimizer is reinitialized here every time the gradient is computed. So if it is in a training loop, then the gradient is not calculated as it should for Adam optimizer. The Adam optimizer should be initialized somewhere outside the code block. – Daniel Wiczew Apr 14 '20 at 16:14
  • 1
    It keeps the direction of tensor. Good! – Eduardo Freitas Jul 08 '20 at 18:59
19

It's easy for tf.keras!

optimizer = tf.keras.optimizers.Adam(clipvalue=1.0)

This optimizer will clip all gradients to values between [-1.0, 1.0].

See the docs.

Nicolas Gervais
  • 33,817
  • 13
  • 115
  • 143
  • 3
    Also, if we use custom training and use `optimizer.apply_gradients` we need to clip the gradient before calling this method. In that case, we need `gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients]` followed by `.apply_graidents`. – Innat Mar 07 '21 at 04:46
  • 2
    It also supports `clipnorm` and apparently `global_clipnorm`: optimizer = tf.keras.optimizers.Adam(global_clipnorm=5.0) – James Hirschorn Jun 17 '21 at 14:33
10

This is actually properly explained in the documentation.:

Calling minimize() takes care of both computing the gradients and applying them to the variables. If you want to process the gradients before applying them you can instead use the optimizer in three steps:

  • Compute the gradients with compute_gradients().
  • Process the gradients as you wish.
  • Apply the processed gradients with apply_gradients().

And in the example they provide they use these 3 steps:

# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)

Here MyCapper is any function that caps your gradient. The list of useful functions (other than tf.clip_by_value()) is here.

Vishnuvardhan Janapati
  • 3,088
  • 1
  • 16
  • 25
Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
  • would you still call `opt.minimize()` after or would you call something different like `opt.run()` as is suggested in some of the comments on other answers? – reese0106 Jan 26 '18 at 14:07
  • @reese0106 No, you need to assign the `opt.apply_gradients(...)` to a variable like `train_step` for example (just like you would for `opt.minimize()`. And the in your main loop you call it like usual to train `sess.run([train_step, ...], feed_dict)` – dsalaj Aug 16 '18 at 07:15
  • Keep in mind that the gradient is defined as the vector of derivatives of the loss wrt to all parameters in the model. TensorFlow represents it as a Python list that contains a tuple for each variable and its gradient. This means to clip the gradient norm, you cannot clip each tensor individually, you need to consider the list at once (e.g. using `tf.clip_by_global_norm(list_of_tensors)`). – danijar Apr 14 '20 at 16:27
  • 404 on the link – John Glen Mar 21 '22 at 11:25
9

For those who would like to understand the idea of gradient clipping (by norm):

Whenever the gradient norm is greater than a particular threshold, we clip the gradient norm so that it stays within the threshold. This threshold is sometimes set to 5.

Let the gradient be g and the max_norm_threshold be j.

Now, if ||g|| > j , we do:

g = ( j * g ) / ||g||

This is the implementation done in tf.clip_by_norm

kmario23
  • 57,311
  • 13
  • 161
  • 150
  • if I need to select the threshold by hand, are there any common method to do this? – ningyuwhut Jun 20 '18 at 09:58
  • This is sort of a black magic suggested in some papers. Otherwise, you've to do lot of experiments and find out which one works better. – kmario23 Jun 20 '18 at 13:35
5

IMO the best solution is wrapping your optimizer with TF's estimator decorator tf.contrib.estimator.clip_gradients_by_norm:

original_optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(original_optimizer, clip_norm=5.0)
train_op = optimizer.minimize(loss)

This way you only have to define this once, and not run it after every gradients calculation.

Documentation: https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/clip_gradients_by_norm

LouYu
  • 671
  • 1
  • 6
  • 14
Ido Cohn
  • 1,685
  • 3
  • 21
  • 28
3

Gradient Clipping basically helps in case of exploding or vanishing gradients.Say your loss is too high which will result in exponential gradients to flow through the network which may result in Nan values . To overcome this we clip gradients within a specific range (-1 to 1 or any range as per condition) .

clipped_value=tf.clip_by_value(grad, -range, +range), var) for grad, var in grads_and_vars

where grads _and_vars are the pairs of gradients (which you calculate via tf.compute_gradients) and their variables they will be applied to.

After clipping we simply apply its value using an optimizer. optimizer.apply_gradients(clipped_value)

Raj
  • 71
  • 8
1

Method 1

if you are training your model using your custom training loop then the one update step will look like

'''
 for loop over full dataset
 X -> training samples
 y -> labels
'''
optimizer = tf.keras.optimizers.Adam()
for x, y in train_Data:
    with tf.GradientTape() as tape:
            prob = model(x, training=True)
            # calculate loss
            train_loss_value = loss_fn(y, prob)
        
        # get gradients
        gradients = tape.gradient(train_loss_value, model.trainable_weights)
        # clip gradients if you want to clip by norm
        gradients = [(tf.clip_by_norm(grad, clip_norm=1.0)) for grad in gradients]
        # clip gradients via values
        gradients = [(tf.clip_by_value(grad, clip_value_min=-1.0, clip_value_max=1.0)) for grad in gradients]
        # apply gradients
        optimizer.apply_gradients(zip(gradients, model.trainable_weights))

Method 2

Or you could also simply just replace the first line in above code as below

# for clipping by norm
optimizer = tf.keras.optimizers.Adam(clipnorm=1.0)
# for clipping by value
optimizer = tf.keras.optimizers.Adam(clipvalue=0.5)

second method will also work if you are using model.compile -> model.fit pipeline.