54

I would like to replace or modify the gradient of an op or portion of the graph in tensorflow. It would be ideal if I can use the existing gradient in the calculation.

In some ways this is the opposite to what tf.stop_gradient() does: instead of adding a calculation which is ignored when calculating gradients, I want a calculation which is only used when calculating gradients.

A simple example would be something which simply scales gradients by multiplying them with a constant (but does not multiply the forward calculation by a constant). Another example would be something which clips the gradients to a given range.

Alex I
  • 19,689
  • 9
  • 86
  • 158

6 Answers6

60

For TensorFlow 1.7 and TensorFlow 2.0 look at edit blow.


First define your custom gradient:

@tf.RegisterGradient("CustomGrad")
def _const_mul_grad(unused_op, grad):
  return 5.0 * grad

Since you want nothing to happen in the forward pass, override the gradient of an identity operation with your new gradient:

g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "CustomGrad"}):
  output = tf.identity(input, name="Identity")

Here is a working example with a layer that clips gradients in the backwards pass and does nothing in the forwards pass, using the same method:

import tensorflow as tf

@tf.RegisterGradient("CustomClipGrad")
def _clip_grad(unused_op, grad):
  return tf.clip_by_value(grad, -0.1, 0.1)

input = tf.Variable([3.0], dtype=tf.float32)

g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "CustomClipGrad"}):
  output_clip = tf.identity(input, name="Identity")
grad_clip = tf.gradients(output_clip, input)

# output without gradient clipping in the backwards pass for comparison:
output = tf.identity(input)
grad = tf.gradients(output, input)

with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  print("with clipping:", sess.run(grad_clip)[0])
  print("without clipping:", sess.run(grad)[0])

Edit for TensorFlow 1.7 and TensorFlow 2.0

Since 1.7 there is a new way to redefine the gradient with shorter syntax, which also works with Tensorflow 2.0. It also allows to redefine the gradient of multiple operations at the same time. Here are the examples from above, rewritten for TensorFlow 1.7 and TensorFlow 2.0:

Layer that scales gradients in the backward pass:

@tf.custom_gradient
def scale_grad_layer(x):
  def grad(dy):
    return 5.0 * dy
  return tf.identity(x), grad

Example with a layer that clips gradients in the backward pass:

@tf.custom_gradient
def clip_grad_layer(x):
  def grad(dy):
    return tf.clip_by_value(dy, -0.1, 0.1)
  return tf.identity(x), grad
BlueSun
  • 3,541
  • 1
  • 18
  • 37
  • Would this modify later gradients in the chain as well or no? – Kevin P Mar 09 '18 at 16:57
  • 3
    @KevinP for example, for the clipping: the gradients will be clipped only 1 time during the backward pass of the identity operation. But all previous layers in the chain will be affected by that, because every layer uses the gradients of its following layer for their backward pass. But the previous layers them-self will not clip again. – BlueSun Mar 09 '18 at 18:58
  • Thanks. The whole backprop vs. forward made the question more confusing than intended. I did mean later in the backprop gradient chain. – Kevin P Mar 09 '18 at 19:07
  • Can `grad` accept other parameters, e.g., some intermedia variables for reducing computation? – huangbiubiu Aug 10 '18 at 02:48
  • @HuangYuheng grad can not accept additional parameters, but it can use tensorflow variables, which in turn can be changed in the forward pass. – BlueSun Aug 22 '18 at 09:59
18

Assuming the forward computation is

y = f(x)

And you want it to backpropagate like

y = b(x)

A simple hack will be:

y = b(x) + tf.stop_gradient(f(x) - b(x))
Bily
  • 751
  • 6
  • 15
17

use optimizer.compute_gradients or tf.gradient to get original gradients
then do whatever you want
finally, use optimizer.apply_gradients

I found an example from github

xxi
  • 1,430
  • 13
  • 24
  • Thank you, this is interesting. I think it replaces the complete (end to end) gradients though, and only for the optimizer. I want to replace the gradient of a single op, while letting gradients from other ops propagate through that the way they would normally; I don't necessarily know what to do to the end-to-end gradient. An example would be to have a tf.matmult() where the forward calculation is done normally, but the gradient is clip(grad, min, max) where grad is the original gradient, and have that be used in a larger graph. – Alex I May 12 '17 at 05:49
  • 1
    take a look [compute_gradients](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer#compute_gradients), it return a list of `(gradient, variable)` pairs so I think you can only modify "the" gradient you want, like [this](https://github.com/KelvinLu/krotos-convnet/blob/e37218aeaf10b73d77dfac911be46d8ab689e41d/krotos/convnet/model/training.py#L27), find the `var` you want – xxi May 12 '17 at 06:20
9

The most general way to do that is by using https://www.tensorflow.org/api_docs/python/tf/RegisterGradient

Below, I implemented backpropagated gradient clipping, which can be used with matmul, as shown here, or any other op:

import tensorflow as tf
import numpy as np

# from https://gist.github.com/harpone/3453185b41d8d985356cbe5e57d67342
def py_func(func, inp, Tout, stateful=True, name=None, grad=None):

    # Need to generate a unique name to avoid duplicates:
    rnd_name = 'PyFuncGrad' + str(np.random.randint(0, 1E+8))

    tf.RegisterGradient(rnd_name)(grad)
    g = tf.get_default_graph()
    with g.gradient_override_map({"PyFunc": rnd_name}):
        return tf.py_func(func, inp, Tout, stateful=stateful, name=name)

def clip_grad(x, clip_value, name=None):
    """"
    scales backpropagated gradient so that
    its L2 norm is no more than `clip_value`
    """
    with tf.name_scope(name, "ClipGrad", [x]) as name:
        return py_func(lambda x : x,
                        [x],
                        [tf.float32],
                        name=name,
                        grad=lambda op, g : tf.clip_by_norm(g, clip_value))[0]

Example usage:

with tf.Session() as sess:
    x = tf.constant([[1., 2.], [3., 4.]])
    y = tf.constant([[1., 2.], [3., 4.]])

    print('without clipping')
    z = tf.matmul(x, y)
    print(tf.gradients(tf.reduce_sum(z), x)[0].eval())

    print('with clipping')
    z = tf.matmul(clip_grad(x, 1.0), clip_grad(y, 0.5))
    print(tf.gradients(tf.reduce_sum(z), x)[0].eval())

    print('with clipping between matmuls')
    z = tf.matmul(clip_grad(tf.matmul(x, y), 1.0), y)
    print(tf.gradients(tf.reduce_sum(z), x)[0].eval())

Output:

without clipping
[[ 3.  7.]
 [ 3.  7.]]
with clipping
[[ 0.278543   0.6499337]
 [ 0.278543   0.6499337]]
with clipping between matmuls
[[ 1.57841039  3.43536377]
 [ 1.57841039  3.43536377]]
MWB
  • 11,740
  • 6
  • 46
  • 91
  • MaxB: Thank you! This looks useful. I'm not sure how to define a new op in python through... is it just a function with a decorator? Could you do a full example of matmult with clipped gradients? – Alex I May 13 '17 at 01:10
  • @AlexI It's not easy, but it's doable: http://stackoverflow.com/questions/37924071/tensorflow-writing-an-op-in-python If you just want to clip the gradients, I suggest you define an "identity op" that does nothing else but clip the gradient. Also, see https://www.tensorflow.org/extend/adding_an_op#implement_the_gradient_in_python – MWB May 13 '17 at 02:38
  • @AlexI I implemented actual backpropagated gradient clipping. See edit – MWB May 13 '17 at 09:02
2

For TensorFlow 2, you should use the tf.custom_gradient decorator as follows:

@tf.custom_gradient
def func(x):
    f = # calculate forward pass
    def grad(dy):
        gradient = # calculate custom gradient of func
        return dy * gradient
    return f, grad

Note that you must multiply gradient by the upstream gradients. Be wary though!

If you call this as a function when creating a Keras functional model and use tf.GradientTape, then automatic differentiation will still take place, and your custom gradient will be ignored.

Instead, you must put your function into a layer:

class func_layer(tf.keras.layers.Layer):
    def __init__(self):
        super(func_layer, self).__init__()

    def call(self, x):
        return func(x)

Now, when you add a func_layer to your functional model, the backward pass will be calculated appropriately.

Alex Trevithick
  • 711
  • 8
  • 11
1

For current TensorFlow r1.13, use tf.custom_gradient.

The decorated function (input arguments is a list x) should return

  • the result of the forward pass, and
  • a function which returns a list of gradients, one for each element in x.

Here's an example with one variable:

@tf.custom_gradient
def non_differentiable(x):
    f = tf.cast(x > 0, tf.float32)
    def grad(dy):
        return tf.math.maximum(0., 1 - tf.abs(x))
    return f, grad

And one with two:

@tf.custom_gradient
def non_differentiable2(x0, x1):
    f = x0 * tf.cast(x1 > 0, tf.float32)
    def grad(dy):
        df_dx0 = tf.cast(x1 > 0, tf.float32)
        return dy*df_dx0, tf.zeros_like(dy)
    return f, grad
cheersmate
  • 2,385
  • 4
  • 19
  • 32
  • Hi cheersmate, thank you for the answer. Do you know how to change the gradient for the relu function? – layser Aug 20 '19 at 12:18