0

In the below code:

  • dy is computed as 1. How is this value getting computed (whats the math)? as per tf.custom_gradient guide, dy is upstream by gradient

  • Why final gradients is getting multiplied by clip_norm value(0.6)? (It means final_gradients of (v * v) is getting multiplied by 0.6 , gradient of v * v is 2v, why is multiplied by 0.6?)

     @tf.custom_gradient
    
     def clip_gradients(y):
    
       print('y',y)
    
       def backward(dy):
    
         print('dy',dy)
    
         return tf.clip_by_norm(dy, 0.6)
       return y, backward
    
    
     v = tf.Variable(3.0)
    
     with tf.GradientTape() as t:
       output = clip_gradients(v * v)
       print('output',output)
    
     print('Final Gradient is ',t.gradient(output, v))
    

'''

Code output

y tf.Tensor(9.0, shape=(), dtype=float32)
output tf.Tensor(9.0, shape=(), dtype=float32)
dy tf.Tensor(1.0, shape=(), dtype=float32)
Final Gradient is  tf.Tensor(3.6000001, shape=(), dtype=float32)
Mins
  • 109
  • 1
  • 8
  • What do you mean by `Why final gradients is getting multiplied by clip_norm value(0.6)?` Also, maybe that [answer](https://stackoverflow.com/a/44342968/7370153) (although about TF1) could help you understand. – Lescurel Jul 26 '21 at 13:05
  • Edited the question, pls look into it It means final_gradients of (v * v) is getting multiplied by 0.6 , gradient of v * v is 2v, why is multiplied by 0.6? – Mins Jul 26 '21 at 17:01

1 Answers1

1

dy is initialized to 1. at the beginning of the backpropagation because this is the derivative of the identity function. By applying the chain rule, we know that f(g(x))' is f'(g(x))*g'(x). If f is the identity function (f(x) = x), then the previous expression becomes 1*g'(x).

Your function clip_gradients clips any value of the gradient over 0.6 to 0.6. The initial value of dy is 1.0 (as explained above).

If we apply the chain rule to your example, we have:

  • the derivative of the identity is 1.0, then clipped to 0.6.
  • the derivative of v*v is 2*v

By applying the chain rule, we get the final gradient to be 0.6*2*v, which is equal to 3.6 when v=3.

Lescurel
  • 10,749
  • 16
  • 39
  • 'dy is initialized to 1. at the beginning of the backpropagation because this is the derivative of the identity function' Would this statement always hold - whatever function are we giving in model argument? – Mins Jul 28 '21 at 02:46
  • Yes. because you can always apply `f(model)` where `f` is the identity function without changing your end result. – Lescurel Jul 28 '21 at 06:35