4

I was reading Baydin et al, Automatic Differentiation in Machine Learning: a Survey, 2018 (Arxiv), which differentiates between symbolic differentiation and automatic differentiation (AD). It then says:

AD Is Not Symbolic Differentiation. Symbolic differentiation is the automatic manipulation of [symbolic] expressions.

AD can be thought of as performing a non-standard interpretation of a computer program where this interpretation involves augmenting the standard computation with the calculation of various derivatives.

Evaluation traces form the basis of the AD techniques. [A computational graph (Bauer, 1974) visualizes dependency relations of (input, working, output) variables in evaluation traces.]

It then goes on by describing how to compute the derivative with AD (in forward or backward mode). The description is basically transforming the evaluation trace / computational graph.

Autograd, Chainer, and PyTorch provide general-purpose reverse mode AD.

It also discusses Theano, TensorFlow, and others, but it basically compares define-and-run / static computational graph (Theano, TF) vs define-by-run / dynamic computational graph (PyTorch, TF Eager). (This would be orthogonal in my understanding to the question of how AD is performed, or would mostly just change how AD is implemented, but not so much the concept of AD.)

Theano is a computational graph optimizer and compiler [...] and it currently handles derivatives in a highly optimized form of symbolic differentiation. The result can be interpreted as a hybrid of symbolic differentiation and reverse mode AD, but Theano does not use the general-purpose reverse accumulation as we describe in this paper. (Personal communication with the authors.)

I'm not sure if the authors imply that Theano/TF do not provide general-purpose reverse mode AD (which would be wrong in my understanding).

I don't exactly understand how Theano does not use the general-purpose reverse accumulation.

Also, I don't understand how symbolic differentiation is different from AD, given this definition.

Or: How are symbolic expressions different from computational graphs?

Related is also differentiable programming

differentiable directed graphs assembled from functional blocks

where I again do not see the difference to a computational graph.

And backpropagation (BP):

The resulting algorithm is essentially equivalent to transforming the network evaluation function composed with the objective function under reverse mode AD, which, as we shall see, actually generalizes the backpropagation idea.

I don't see how reverse mode AD is more general than backpropagation. Is it? How?

Schmidhuber, Deep Learning in Neural Networks: An Overview, 2014 (section 5.5) (also) states:

BP is also known as the reverse mode of automatic differentiation (Griewank, 2012).

Albert
  • 65,406
  • 61
  • 242
  • 386

1 Answers1

3

This is a nice question, which gets at some fundamental differences in AD and also some fundamental design differences between big ML libraries like PyTorch and TensorFlow. In particular, I think understanding the difference between define-by-run and define-and-run AD is confusing and takes some time to appreciate.

Backpropagation versus Reverse-Mode AD?

You can see a stack overflow question here, and my answer to it. Basically, the difference is whether you want the gradient of a scalar-valued function R^n -> R or the vector-Jacobian product of a vector-valued function R^n -> R^m. Backpropagation assumes you want the gradient of a scalar loss function, and is a term most commonly used in the machine learning community to talk about neural network training.

Reverse-mode AD is therefore more general than backpropagation.

How is symbolic differentiation different from AD?

Symbolic differentiation acts on symbols which represent inputs, while AD computes a numerical value of the derivative for a given input.

For example: suppose I have the function y = x^2. If I were to compute the symbolic derivative of y, I would get the value 2x as the symbolic derivative. Now, for any value of x, I immediate know the value of the derivative at that x. But if I were to perform automatic differentiation, I would first set the value of x, say x=5, my AD tool would tell me the derivative is 2*5, but it wouldn't know anything about the derivative at x=4 since it only computed the derivative at x=5, not a symbolic expression for the derivative.

Difference between define-and-run / static computational graph and define-by-run / dynamic computational graph?

As you point out, TF1 and Theano are define-and-run, while Pytorch, Autograd, and TF2 are define-by-run. What is the difference?

In TensorFlow 1, you told TensorFlow what you were going to do, and then TensorFlow prepared to perform those computations on some data by building the static computational graph, and then finally you received the data and performed the calculations. So step 1 was telling TensorFlow what you were going to do, and step 2 was performing that calculation once TensorFlow got some data.

In Autograd, you don't tell it what you are going to do before you do it. Autograd, unlike TF1, finds out what you are going to do to your data after it receives the data. If it receives a vector, it has no idea what computations are going to be performed on the vector, because it has no static computational graph ahead of time. It "builds the graph" by recording operations on each variable as the code executes, and then at the end of your computation you have a list of the operations which were performed which you can traverse backwards. This allows you to easily include control flow like if statements. Handling control flow in a define-and-run framework is much more difficult.

Why does Theano and TF1 not provide general purpose reverse-mode AD?

Theano and TF1 do not provide general-purpose AD because they don't allow for control flow. Actually, TF1 did, but it was a mess.

Difference between differentiable programming and computational graph?

From Wikipedia:

"Differentiable programming is a programming paradigm in which the programs can be differentiated throughout, usually via automatic differentiation."

So differentiable programming is a paradigm for designing programs. A computational graph, on the other hand, is an abstraction used in the AD field for understanding the computations performed by a differentiable computer program. One is a programming paradigm, one is a programming abstraction.

Nick McGreivy
  • 611
  • 5
  • 10