How dy(upstream gradient in Tensorflow) is getting calculated below?

Question

In the below code:

dy is computed as 1. How is this value getting computed (whats the math)? as per tf.custom_gradient guide, dy is upstream by gradient

Why final gradients is getting multiplied by clip_norm value(0.6)? (It means final_gradients of (v * v) is getting multiplied by 0.6 , gradient of v * v is 2v, why is multiplied by 0.6?)

 @tf.custom_gradient

 def clip_gradients(y):

   print('y',y)

   def backward(dy):

     print('dy',dy)

     return tf.clip_by_norm(dy, 0.6)
   return y, backward


 v = tf.Variable(3.0)

 with tf.GradientTape() as t:
   output = clip_gradients(v * v)
   print('output',output)

 print('Final Gradient is ',t.gradient(output, v))

'''

Code output

y tf.Tensor(9.0, shape=(), dtype=float32)
output tf.Tensor(9.0, shape=(), dtype=float32)
dy tf.Tensor(1.0, shape=(), dtype=float32)
Final Gradient is  tf.Tensor(3.6000001, shape=(), dtype=float32)

What do you mean by `Why final gradients is getting multiplied by clip_norm value(0.6)?` Also, maybe that [answer](https://stackoverflow.com/a/44342968/7370153) (although about TF1) could help you understand. — Lescurel, Jul 26 '21 at 13:05
Edited the question, pls look into it It means final_gradients of (v * v) is getting multiplied by 0.6 , gradient of v * v is 2v, why is multiplied by 0.6? — Mins, Jul 26 '21 at 17:01

score 1 · Accepted Answer · answered Jul 27 '21 at 12:25

1

dy is initialized to 1. at the beginning of the backpropagation because this is the derivative of the identity function. By applying the chain rule, we know that f(g(x))' is f'(g(x))*g'(x). If f is the identity function (f(x) = x), then the previous expression becomes 1*g'(x).

Your function clip_gradients clips any value of the gradient over 0.6 to 0.6. The initial value of dy is 1.0 (as explained above).

If we apply the chain rule to your example, we have:

the derivative of the identity is 1.0, then clipped to 0.6.
the derivative of v*v is 2*v

By applying the chain rule, we get the final gradient to be 0.6*2*v, which is equal to 3.6 when v=3.

answered Jul 27 '21 at 12:25

Lescurel

10,749
16
39

'dy is initialized to 1. at the beginning of the backpropagation because this is the derivative of the identity function' Would this statement always hold - whatever function are we giving in model argument? – Mins Jul 28 '21 at 02:46
Yes. because you can always apply `f(model)` where `f` is the identity function without changing your end result. – Lescurel Jul 28 '21 at 06:35

How dy(upstream gradient in Tensorflow) is getting calculated below?

1 Answers1