Abysmal tf.GradientTape performance compared to tf.gradients() for computing jacobians

Question

SOLUTION BELOW:

Scenario:

I am trying to compute the jacobian of a user defined function many, many times in a loop. I am able to do this with TF 2's GradientTape as well as the older session based tf.gradients() method. The problem is that GradientTape is terribly slow (100x slower) than tf.gradients(). It has features i'd like to use (bath_jacobian, hessian support, etc), but if it's 100x slower then i can't use it.

The Question:

It's not clear to me if i'm simply misusing GradientTape, or if it will always be slower because it has to re-differentiate the provided function every time its called (my suspicion). I'm asking for tips to fix my use of GradientTape or a confirmation that it will always be fundamentally slower than tf.gradients by orders of magnitude.

Related Questions:

Repeated use of GradientTape for multiple Jacobian calculations - same scenario, unanswered
Does `GradientTape` need to re-differentiate each evaluation of a derivative? - same scenario, unanswered
using one GradientTape with global context - loosely related, having trouble applyng that solution to my scenario

Fully contained minimum example to compare GradientTape and tf.gradients():

import tensorflow as tf
from tensorflow.python.framework.ops import disable_eager_execution
import numpy as np
# from tensorflow.python.ops.parallel_for.gradients import jacobian, batch_jacobian
import timeit


class FunctionCaller(object):
    def __init__(self, func, nX, dtype=tf.float64, useSessions=True):

        if useSessions:
            disable_eager_execution()

        self.func = func
        self.nX = nX
        self.useSessions = useSessions
        self.dtype = dtype
        self.sess = tf.compat.v1.Session() if useSessions else None

        if not useSessions:
            return

        #
        # we are in session mode, so build the graph and take the batch-jacobian of the function's outputs
        #
        xTensor = tf.compat.v1.placeholder(dtype, shape=[None, nX])

        # add function to graph and guarantee its output shape
        func_tensor = tf.reshape(func(xTensor), [-1, nX])

        # take the gradient for each output, one at a time, and stack the results back together
        each_output = tf.unstack(func_tensor, nX, axis=1)

        jac_x = tf.stack([tf.gradients(output, xTensor, unconnected_gradients='zero')[0]
                          for output in each_output], axis=1)

        # record these tensors so we can use them later with session.run()
        self.xTensor = xTensor
        self.func_tensor = func_tensor
        self.jac_func_tensor = jac_x

    def jac(self, x_i):
        if self.useSessions:
            return self.sess.run(self.jac_func_tensor, {self.xTensor: x_i})
        else:
            return self._useGradientTape(x_i)

    # THIS FUNCTION IS SUPER INEFFICIENT.
    def _useGradientTape(self, x_i):
        with tf.GradientTape(persistent=True) as g:
            xTensor = tf.Variable(x_i, dtype=self.dtype)  # is this my problem??? i recreate x every time?
            y = tf.reshape(self.func(xTensor), [-1, self.nX])
        jac_x_at_i = g.batch_jacobian(y, xTensor)
        # del g
        return jac_x_at_i.numpy()

    def __del__(self):
        if self.sess is not None:
            self.sess.close()


def main():
    @tf.function
    def Xdot(x_i):
        x_0, x_1, x_2 = tf.split(x_i, 3, axis=1)
        return tf.concat([x_2 * tf.sin(x_2), x_2 * tf.cos(x_2), x_2], axis=1)

    nT = 20
    nX = 3

    # create some trash data
    x_i = np.arange(nT*nX).reshape([-1, nX])

    nTrials = 100

    # try the eager version first
    caller_eager = FunctionCaller(Xdot, nX, useSessions=False)
    start_time = timeit.default_timer()
    for _ in range(nTrials):
        jac_eager = caller_eager.jac(x_i)
    elapsed = timeit.default_timer() - start_time
    print("eager code took {} sec: {} sec/trial".format(elapsed, elapsed/nTrials))

    # now try the sessions version
    caller_sessions = FunctionCaller(Xdot, nX, useSessions=True)
    start_time = timeit.default_timer()
    caller_sessions.jac(x_i)  # call it once to do its graph building stuff?
    for _ in range(nTrials):
        jac_session = caller_sessions.jac(x_i)
    elapsed = timeit.default_timer() - start_time
    print("session code took {} sec: {} sec/trial".format(elapsed, elapsed/nTrials))

    residual = np.max(np.abs(jac_eager - jac_session))
    print('residual between eager and session trials is {}'.format(residual))

if __name__ == "__main__":
    main()

EDIT - SOLUTION:

xdurch0 pointed out below that I should wrap _useGradientTape() in a @tf.function - something I was unsuccessful with before for other reasons. Once I did that, I had to move xTensor's definition outside the @tf.function wrapper by making it a member variable and using tf.assign().

With all this done, I find that GradientTape (for this simple example) is now on the same order of magnitude as tf.gradints. When running enough trials (~1E5), it's twice as fast as tf.gradients. awesome!

import tensorflow as tf
from tensorflow.python.framework.ops import disable_eager_execution
import numpy as np
import timeit


class FunctionCaller(object):
    def __init__(self, func, nT, nX, dtype=tf.float64, useSessions=True):

        if useSessions:
            disable_eager_execution()

        self.func = func
        self.nX = nX
        self.useSessions = useSessions
        self.dtype = dtype
        self.sess = tf.compat.v1.Session() if useSessions else None

        if not useSessions:
            #  you should be able to create without an initial value, but tf is demanding one
            #  despite what the docs say. bug?
            #  tf.Variable(initial_value=None, shape=[None, nX], validate_shape=False, dtype=self.dtype)
            self.xTensor = tf.Variable([[0]*nX]*nT, dtype=self.dtype)  # x needs to be properly sized once
            return

        #
        # we are in session mode, so build the graph and take the batch-jacobian of the function's outputs
        #
        xTensor = tf.compat.v1.placeholder(dtype, shape=[None, nX])

        # add function to graph and guarantee its output shape
        func_tensor = tf.reshape(func(xTensor), [-1, nX])

        # take the gradient for each output, one at a time, and stack the results back together
        each_output = tf.unstack(func_tensor, nX, axis=1)

        jac_x = tf.stack([tf.gradients(output, xTensor, unconnected_gradients='zero')[0]
                          for output in each_output], axis=1)

        # record these tensors so we can use them later with session.run()
        self.xTensor = xTensor
        self.func_tensor = func_tensor
        self.jac_func_tensor = jac_x

    def jac(self, x_i):
        if self.useSessions:
            return self.sess.run(self.jac_func_tensor, {self.xTensor: x_i})
        else:
            return self._useGradientTape(x_i).numpy()

    @tf.function  # THIS IS CRUCIAL
    def _useGradientTape(self, x_i):
        with tf.GradientTape(persistent=True) as g:
            self.xTensor.assign(x_i)  # you need to create the variable once outside the graph
            y = tf.reshape(self.func(self.xTensor), [-1, self.nX])
        jac_x_at_i = g.batch_jacobian(y, self.xTensor)
        # del g
        return jac_x_at_i

    def __del__(self):
        if self.sess is not None:
            self.sess.close()


def main():
    @tf.function
    def Xdot(x_i):
        x_0, x_1, x_2 = tf.split(x_i, 3, axis=1)
        return tf.concat([x_2 * tf.sin(x_2), x_2 * tf.cos(x_2), x_2], axis=1)

    nT = 20
    nX = 3

    # create some trash data
    x_i = np.random.random([nT, nX])

    nTrials = 1000  # i find that nTrials<=1E3, eager is slower, it's faster for >=1E4, it's TWICE as fast for >=1E5

    # try the eager version first
    caller_eager = FunctionCaller(Xdot, nT, nX, useSessions=False)
    start_time = timeit.default_timer()
    for _ in range(nTrials):
        jac_eager = caller_eager.jac(x_i)
    elapsed = timeit.default_timer() - start_time
    print("eager code took {} sec: {} sec/trial".format(elapsed, elapsed/nTrials))

    # now try the sessions version
    caller_sessions = FunctionCaller(Xdot, nT, nX, useSessions=True)
    start_time = timeit.default_timer()
    for _ in range(nTrials):
        jac_session = caller_sessions.jac(x_i)
    elapsed = timeit.default_timer() - start_time
    print("session code took {} sec: {} sec/trial".format(elapsed, elapsed/nTrials))

    residual = np.max(np.abs(jac_eager - jac_session))
    print('residual between eager and session trials is {}'.format(residual))

if __name__ == "__main__":
    main()

To have any kind of fair comparison, you should wrap the `GradientTape` code in a `tf.function`. Eager execution will be slower than graph execution. — xdurch0, May 15 '20 at 08:14
@xdurch0, thank you so much! Wrapping the _useGradientTape() function in tf.function did the trick (once I moved xTensor's definition outside). GradientTape now runs at a comparable speed to tf.gradients() **or even faster** if i run more trials. I'm new here. Would it be proper etiquette to edit my original post with updated code? If you rephrase your comment as an answer, I'll mark it as such — keithrausch, May 15 '20 at 17:52
The way I understand this, what you @keithrausch are here labelling 'eager code' (when your useSessions=False) is in fact not 'eager' but static graph. Eager would really be much slower than both of these. See e.g. https://www.tensorflow.org/guide/function and/or https://towardsdatascience.com/eager-execution-vs-graph-execution-which-is-better-38162ea4dbf6. — Morten Grum, Feb 02 '21 at 07:17
yes, I believe you are correct and that I conflated the two (some TF 1.X habits die hard) — keithrausch, Feb 03 '21 at 13:23

Abysmal tf.GradientTape performance compared to tf.gradients() for computing jacobians

EDIT - SOLUTION:

0 Answers0

Linked