Repeated use of GradientTape for multiple Jacobian calculations

Question

I am attempting to compute the Jacobian of a TensorFlow neural network's outputs with respect to its inputs. This is easily achieved with the tf.GradientTape.jacobian method. The trivial example provided in the TensorFlow documentation is as follows:

with tf.GradientTape() as g:
  x  = tf.constant([1.0, 2.0])
  g.watch(x)
  y = x * x
jacobian = g.jacobian(y, x)

This is fine if I want only want to compute the Jacobian of a single instance of the input tensor x. However, I need to repeatedly evaluate this Jacobian many, many times for various instances of x. For a non-trivial Jacobian calculation (e.g. for a deep convolutional neural network with non-linear activation functions), this is incredibly expensive to repeatedly rerun the GradientTape calculation and evaluate the jacobian method. I know from the TensorFlow documentation that the gradients (and hence the Jacobian) are computed via automatic differentiation. I have to imagine there is some internal storage of the analytical gradient of the network (computed by automatic differentiation) which is evaluated at the given inputs.

My question: am I correct in assuming that TensorFlow builds and stores (at least parts of) the analytical gradients needed to compute the Jacobian? And if so, is there a way to save this analytical gradient and re-evaluate the Jacobian with new inputs without having to reconstruct it via the GradientTape method?

A "persistent" GradientTape does not seem to solve this issue: it only allows for the repeated evaluation of a single GradientTape instance with respect to multiple internal arguments of the computation.

score 2 · Accepted Answer · edited May 16 '20 at 02:22

Maybe you find this helpful:

I needed to compute the jacobian of an arbitrary function many, many times. My problem was that I was using GradientTape inappropriately, but the code I posted might help you or give you some insight. I posted a self contained example of calculating the jacobian using both the session based tf.gradient() function and the modern GriadientTape approach. With help, I got them to run within the same order of magnitude of each other.

If your question is focused on trying to reuse the intermediate calculations between calls for a speed boost, then I think Nick's answer is more applicable.
If your question is focused on trying to make GradientTape as fast as a static graph, then make sure you wrap it in @tf.function since it does just that.

See my question: Abysmal tf.GradientTape performance compared to tf.gradients() for computing jacobians

score 1 · Answer 2 · answered Feb 06 '20 at 15:54

Am I correct in assuming that TensorFlow builds and stores (at least parts of) the analytical gradients needed to compute the Jacobian?

Nope - I think you must be misunderstanding something about automatic differentiation.

While each elementary operation in tf "knows" about the analytic derivative of it's output with respect to the input, when the actual gradient or Jacobian values are computed, numerical values of adjoints (derivatives of the output) are passed to the operation on the backwards pass, and then more numerical values are computed using the analytic formula for each elementary operation and the chain rule.

And if so, is there a way to save this analytical gradient and re-evaluate the Jacobian with new inputs without having to reconstruct it via the GradientTape method?

Nope. If you want to compute the gradient or jacobian on a new input, you'll need to perform the whole calculation again. There is no way around this for deep neural networks.

By the way, if you are taking gradients of the loss function of your neural network with respect to the parameters of you network, the time to compute the gradients will be O(1) the cost of computing the loss itself. This is backpropagation, and is part of the beauty of reverse-mode automatic differentiation. But if your network has N outputs, and you want to compute the full jacobian of your network, that will cost O(N) the time of computing the outputs of your network. That might be why it's so expensive to compute a Jacobian.

For example, if you're training a network on MNIST, and your network has 10 outputs which you combine into a single loss function, computing the gradients of the loss function will take O(1) time, but computing the jacobian of the 10 outputs with respect to the parameters will take O(10) time.

Repeated use of GradientTape for multiple Jacobian calculations

2 Answers2

Linked