I am trying to understand how TensorFlow computes the gradients for the tf.train.GradientDescentOptimizer
.
If I understand section 4.1 in the TensorFlow whitepaper correct, it computes the gradients based on backpropagation by adding nodes to the TensorFlow graph which compute the derivation of a node in the original graph.
When TensorFlow needs to compute the gradient of a tensor C with respect to some tensor I on which C depends, it first finds the path in the computation graph from I to C. Then it backtracks from C to I, and for each operation on the backward path it adds a node to the TensorFlow graph, composing the partial gradients along the backwards path using the chain rule. The newly added node computes the “gradient function” for the corresponding operation in the forward path. A gradient function may be registered by any operation. This function takes as input not only the partial gradients computed already along the backward path, but also, optionally, the inputs and outputs of the forward operation. [Section 4.1 TensorFlow whitepaper]
Question 1: Is there a second node implementation for each TensorFlow node which represents the derivation of the original TensorFlow node?
Question 2: Is there a way to visualize which derivation nodes get added to the graph (or any logs)?