What is the difference between backpropagation and reverse-mode autodiff?

Question

Going through this book, I am familiar with the following:

For each training instance the backpropagation algorithm first makes a prediction (forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally slightly tweaks the connection weights to reduce the error.

However I am not sure how this differs from the reverse-mode autodiff implementation by TensorFlow.

As far as I know reverse-mode autodiff first goes through the graph in the forward direction and then in the second pass computes all partial derivatives for the outputs with respect to the inputs. This is very similar to the propagation algorithm.

How does backpropagation differ from reverse-mode autodiff ?

rrz0 · Accepted Answer · 2018-04-19T19:37:21.240

9

Thanks to the answer by David Parks for the valid contribution and useful links, however I have found the answer to this question by the author of the book himself, which may provide a more concise answer:

Bakpropagation refers to the whole process of training an artificial neural network using multiple backpropagation steps, each of which computes gradients and uses them to perform a Gradient Descent step. In contrast, reverse-mode auto diff is simply a technique used to compute gradients efficiently and it happens to be used by backpropagation.

edited Apr 19 '18 at 19:37

answered Apr 19 '18 at 19:23

rrz0

2,182
5
30
65

5

This is a sort of recursive definition: backpropagation consists of multiple backpropagation steps. Which is a rather poor definition I would say. – Jens Wagemaker Mar 08 '19 at 16:19
2

And this is contrary to the language of my tribe, where backpropragation is just the application of the chain rule of derivation, while Gradient Descent (of numerous flavors) is an optimization algorithm using backprop. – dedObed Jan 27 '20 at 20:08

Nick McGreivy · Answer 2 · 2020-01-27T07:34:32.237

The most important distinction between backpropagation and reverse-mode AD is that reverse-mode AD computes the vector-Jacobian product of a vector valued function from R^n -> R^m, while backpropagation computes the gradient of a scalar valued function from R^n -> R. Backpropagation is therefore a subset of reverse-mode AD.

When we train neural networks, we always have a scalar-valued loss function, so we are always using backpropagation. Since backprop is a subset of reverse-mode AD, then we are also using reverse-mode AD when we train a neural network.

Whether or not backpropagation takes the more general definition of reverse-mode AD as applied to a scalar loss function, or the more specific definition of reverse-mode AD as applied to a scalar loss function for training neural networks is a matter of personal taste. It's a word that has slightly different meaning in different contexts, but is most commonly used in the machine learning community to talk about computing gradients of neural network parameters using a scalar loss function.

For completeness: Sometimes reverse-mode AD can compute the full Jacobian on a single reverse pass, not just the vector-Jacobian product. Also, the vector Jacobian product for a scalar function where the vector is the vector [1.0] is the same as the gradient.

a bit late to the party, but this is a very good explanation! — Dorian, Apr 23 '21 at 10:36

David Parks · Answer 3 · 2018-04-19T19:16:33.173

Automatic differentiation differs from the method taught in standard calculus classes on how gradients are computed, and in some features such as its native ability to take the gradient of a data structure and not just a well defined mathematical function. I'm not expert enough to go into further detail, but this is a great reference that explains it in much more depth:

https://alexey.radul.name/ideas/2013/introduction-to-automatic-differentiation/

Here's another guide that looks quite nice that I just found now.

https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation

I believe backprop may formally refer to the by-hand calculus algorithm for computing gradients, at least that's how it was originally derived and is how it's taught in classes on the subject. But in practice, backprop is used quite interchangeably with the automatic differentiation approach described in the above guides. So splitting those two terms is probably as much an effort in linguistics as it is mathematics.

I also noted this nice article on the backpropagation algorithm to compare against the above guides on automatic differentiation.

https://brilliant.org/wiki/backpropagation/

What is the difference between backpropagation and reverse-mode autodiff?

3 Answers3

Linked