How Can I Define Only the Gradient for a Tensorflow Subgraph?

Question

First: I am only a few days in with Tensorflow, so please bear with me.

I started out from the cifar10 tutorial code and I am now using a combination of convolutions and eigenvalue decompositions that break the symbolic differentiation. I.e. the graph gets built, then upon calling train() the script halts with "No gradient defined for operation [...] (op type: SelfAdjointEig)". No surprise there.

The inputs to the subgraph in question are still only the input feature maps and the filters being used, and I have the formulas for the gradients at hand and they should be straight-forward to implement given the inputs to the subgraph and the gradient with respect to its output.

From what I can see in the docs, I can register a gradient method for custom Ops with RegisterGradient or override them with the experimental gradient_override_map. Both of those should give me access to exactly the things I need. For example, searching on Github I find a lot of examples that access the op's inputs as op.input[0] or such.

The problem I have is that I want to essentially "shortcut" a whole subgraph, not a single op, so I have no single op to decorate. Since this is happening in one of the convolutional layers of the cifar example I tried using the scope object for that layer. Conceptually, what enters and exits that scope's graph is exactly what I want so if I could somehow override the whole scope's gradients that would "already" do it.

I saw tf.Graph.create_op which (I think) I could use to register a new type of operation and I could then override that Operation type's gradient computation with aforementioned methods. But I don't see a way of defining that op's forward pass without writing it in C++...

Maybe I am approaching this the wrong way entirely? Since all of my forward or backward operations can be implemented with the python interface I obviously want to avoid implementing anything in C++.

Maybe you can override the gradient for a single op on top of your undifferentiable graph, and then use `tf.stop_gradient()` to prevent the gradient construction for that subgraph? http://stackoverflow.com/questions/33727935/how-to-use-stop-gradient-in-tensorflow — Yaroslav Bulatov, Apr 06 '16 at 17:23
I can imagine locally defining a gradient function, then using the still in-scope inputs in that. But how would I tell tf which nodes' gradients I take as inputs to that gradient computation? This feels to me like I am fundamentally misusing the framework :P — black_puppydog, Apr 07 '16 at 13:17

score 32 · Accepted Answer · answered Apr 07 '16 at 15:06

32

Here's a trick from Sergey Ioffe:

Suppose you want group of ops that behave as f(x) in forward mode, but as g(x) in the backward mode. You implement it as

t = g(x)
y = t + tf.stop_gradient(f(x) - t)

So in your case your g(x) could be an identity op, with a custom gradient using gradient_override_map

answered Apr 07 '16 at 15:06

Yaroslav Bulatov

57,332
22
139
197

2

For comprehension: the `stop_gradient` call takes care of the automatic gradient bit, overriding the gradient for `g` gives me the ability to insert my own and the `t + f(x) - t` will be opimized away? – black_puppydog Apr 12 '16 at 16:26
3

Value of "t + f(x) - t" is equal to "f(x)". It's computationally equivalent in current version, but in future version it may be optimized away – Yaroslav Bulatov Apr 12 '16 at 16:40
2

Finally was able to apply this, albeit not for the same function after all. But this does not generalize well to "compound operations" with multiple inputs because the "add-subtract" doesn't work, does it? The best I could think of (but didn't have to try after all) was somehow using tuples instead of an identity op. But I am a bit unclear on how the graph would look afterwards. Anyway, huge thank you :) – black_puppydog Apr 25 '16 at 09:26
Exactly what I needed. Maybe this should be a built-in? – Paulo Costa May 23 '18 at 15:49
The given solution is neat if you assume that you can easily cancel-out (e.g. subtraction inside stop_gradient) the effect that the backward pass will have on the forward pass. However, assume that the forward function generates a set of random indices that are used to shuffle some features and/or labels used in the network/loss. In this case, a simple "subtraction" will not cancel the effect of calling the randomizer twice. How could we make the second (backward) call innocuous? – Peter Jun 12 '18 at 21:39

score 2 · Answer 2 · answered Jan 03 '19 at 02:48

2

From TensorFlow 1.7 onward, tf.custom_gradient is the way to go.

answered Jan 03 '19 at 02:48

Stephane Bersier

710
7
20

score 0 · Answer 3 · answered May 01 '17 at 18:34

0

How about multiply and divide, instead of adding and subtracting t?

t = g(x)
y = tf.stop_gradient(f(x) / t) * t

answered May 01 '17 at 18:34

Yoh Okuno

11

2

dy/dt here is (f(x)/t)*dy - not what we wanted. stopping the gradient through the left hand side doesn't prevent the derivative of multiplication using the forward result. – lahwran May 12 '17 at 16:42

Grwlf · Answer 4 · 2019-10-01T07:46:59.087

Here is the approach which works for TensorFlow 2.0. Note that in 2.0 we are happy to have 2 different autodiff algorithms: GradientTape for eager mode and tf.gradient for the non-eager mode (here called "lazy"). We demonstrate that tf.custom_gradient works both ways.

import tensorflow as tf
assert tf.version.VERSION.startswith('2.')
import numpy as np
from tensorflow.python.framework.ops import disable_eager_execution, enable_eager_execution
from tensorflow.python.client.session import Session

@tf.custom_gradient
def mysquare(x):
  res = x * x
  def _grad(dy):
    return dy * (2*x)
  return res, _grad

def run_eager():
  enable_eager_execution()

  x = tf.constant(np.array([[1,2,3],[4,5,6]]).astype('float32'))
  with tf.GradientTape() as tape:
    tape.watch(x)
    y = tf.reduce_sum(mysquare(x))

    dy_dx = tape.gradient(y,x)
    print('Eager mode')
    print('x:\n',x.numpy())
    print('y:\n',y.numpy())
    print('dy_dx:\n',dy_dx.numpy())


def run_lazy():
  disable_eager_execution()

  x = tf.constant(np.array([[1,2,3],[4,5,6]]).astype('float32'))
  y = tf.reduce_sum(mysquare(x))
  dy_dx = tf.gradients(y,x)

  with Session() as s:
    print('Lazy mode')
    print('x:\n',x.eval(session=s))
    print('y:\n',y.eval(session=s))
    assert len(dy_dx)==1
    print('dy_dx:\n',dy_dx[0].eval(session=s))

if __name__ == '__main__':
  run_eager()
  run_lazy()

How Can I Define Only the Gradient for a Tensorflow Subgraph?

4 Answers4

Linked