2

I would like to intercept gradients that are backpropagated in my Tensorflow graph, which are not based on the loss (∂L/∂w), but based on some other node in the graph, for example the class scores (∂s/∂w) in a classification problem or some activation (∂a/∂w) to see how it changes when certain weights w change.

How can one implement this efficiently in Tensorflow? Intuitively, the gradients should already all be there for backprop of the loss as intermediate results, so there should be a solution without a big overhead.

I am already aware of the following suggestions, which don't exactly solve the problem:

  • The Tensorflow method tf.gradients(ys, xs), which computes the gradient for every y in ys w.r.t. every xs, but then, for every x in xs sums over all y. Applying this function for every y in ys separately, however, induces a large computational overhead.

  • This stackoverflow post, which ask this question for the derivative of the loss w.r.t. some parameters, i.e. ∂L/∂w.

  • The part of the documentation, which proposes to call optimizer.compute_gradients() as an easy to use 'wrapper' around tf.gradients(). However, calling this function for every variable of interest introduces again a large computational overhead.

Update: Phrased differently, what I want is the Jacobian of any component of the computational graph w.r.t. any other. This topic has been touched in this recent Tensorflow issue, but is described as currently not being efficiently/conveniently implemented therein.

Community
  • 1
  • 1
madison54
  • 743
  • 2
  • 8
  • 19
  • 1
    How big is the computational overhead for recipe 3? TF has common-subexpression elimination, if it works for your network, here shouldn't be too much overhead – Yaroslav Bulatov Oct 17 '16 at 20:30
  • 1
    BTW, the awkwardness of API of getting to "intermediate results" is because TensorFlow is free to rewrite the graph to get you the result faster, so the intermediate results may actually not be getting computed – Yaroslav Bulatov Oct 17 '16 at 20:48
  • 1
    The awkward way may be to use `override_gradient_map` to give custom gradient for ops you want to capture. The custom gradient would return to the original gradient (from the registry which `@RegisterGradient` modify), and also make a copy of that tensor available (ie, as a global variable that you can put into sess.run) – Yaroslav Bulatov Oct 17 '16 at 20:55
  • To your comment to recipe 3: I don't know how to vectorize these function calls, so that I am stuck with a possibly pretty large loop around `optimizer.compute_gradients()`. Do you see a better way to implement it? – madison54 Oct 17 '16 at 21:43

0 Answers0