I was reading Baydin et al, Automatic Differentiation in Machine Learning: a Survey, 2018 (Arxiv), which differentiates between symbolic differentiation and automatic differentiation (AD). It then says:
AD Is Not Symbolic Differentiation. Symbolic differentiation is the automatic manipulation of [symbolic] expressions.
AD can be thought of as performing a non-standard interpretation of a computer program where this interpretation involves augmenting the standard computation with the calculation of various derivatives.
Evaluation traces form the basis of the AD techniques. [A computational graph (Bauer, 1974) visualizes dependency relations of (input, working, output) variables in evaluation traces.]
It then goes on by describing how to compute the derivative with AD (in forward or backward mode). The description is basically transforming the evaluation trace / computational graph.
Autograd, Chainer, and PyTorch provide general-purpose reverse mode AD.
It also discusses Theano, TensorFlow, and others, but it basically compares define-and-run / static computational graph (Theano, TF) vs define-by-run / dynamic computational graph (PyTorch, TF Eager). (This would be orthogonal in my understanding to the question of how AD is performed, or would mostly just change how AD is implemented, but not so much the concept of AD.)
Theano is a computational graph optimizer and compiler [...] and it currently handles derivatives in a highly optimized form of symbolic differentiation. The result can be interpreted as a hybrid of symbolic differentiation and reverse mode AD, but Theano does not use the general-purpose reverse accumulation as we describe in this paper. (Personal communication with the authors.)
I'm not sure if the authors imply that Theano/TF do not provide general-purpose reverse mode AD (which would be wrong in my understanding).
I don't exactly understand how Theano does not use the general-purpose reverse accumulation.
Also, I don't understand how symbolic differentiation is different from AD, given this definition.
Or: How are symbolic expressions different from computational graphs?
Related is also differentiable programming
differentiable directed graphs assembled from functional blocks
where I again do not see the difference to a computational graph.
And backpropagation (BP):
The resulting algorithm is essentially equivalent to transforming the network evaluation function composed with the objective function under reverse mode AD, which, as we shall see, actually generalizes the backpropagation idea.
I don't see how reverse mode AD is more general than backpropagation. Is it? How?
Schmidhuber, Deep Learning in Neural Networks: An Overview, 2014 (section 5.5) (also) states:
BP is also known as the reverse mode of automatic differentiation (Griewank, 2012).