5

I am having trouble understanding the conceptual meaning of the grad_outputs option in torch.autograd.grad.

The documentation says:

grad_outputs should be a sequence of length matching output containing the “vector” in Jacobian-vector product, usually the pre-computed gradients w.r.t. each of the outputs. If an output doesn’t require_grad, then the gradient can be None).

I find this description quite cryptic. What exactly do they mean by Jacobian-vector product? I know what the Jacobian is, but not sure about what product they mean here: element-wise, matrix product, something else? I can't tell from my example below.

And why is "vector" in quotes? Indeed, in the example below I get an error when grad_outputs is a vector, but not when it is a matrix.

>>> x = torch.tensor([1.,2.,3.,4.], requires_grad=True)
>>> y = torch.outer(x, x)

Why do we observe the following output; how was it computed?

>>> y
tensor([[ 1.,  2.,  3.,  4.],
        [ 2.,  4.,  6.,  8.],
        [ 3.,  6.,  9., 12.],
        [ 4.,  8., 12., 16.]], grad_fn=<MulBackward0>)

>>> torch.autograd.grad(y, x, grad_outputs=torch.ones_like(y))
(tensor([20., 20., 20., 20.]),)

However, why this error?

>>> torch.autograd.grad(y, x, grad_outputs=torch.ones_like(x))  

RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4]) and output[0] has a shape of torch.Size([4, 4]).

Ivan
  • 34,531
  • 8
  • 55
  • 100
user118967
  • 4,895
  • 5
  • 33
  • 54

1 Answers1

12

If we take your example we have function f which takes as input x shaped (n,) and outputs y = f(x) shaped (n, n). The input is described as column vector [x_i]_i for i ∈ [1, n], and f(x) is defined as matrix [y_jk]_jk = [x_j*x_k]_jk for j, k ∈ [1, n]².

It is often useful to compute the gradient of the output with respect to the input (or sometimes w.r.t the parameters of f, there are none here). In the more general case though, we are looking to compute dL/dx and not just dy/dx, where dL/dx is the partial derivative of L, computed from y, w.r.t. x.

The computation graph looks like:

x.grad = dL/dx <-------   dL/dy y.grad
                dy/dx
       x       ------->    y = x*xT

Then, if we look at dL/dx, which is, via the chain rule equal to dL/dy*dy/dx. We have, looking at the interface of torch.autograd.grad, the following correspondences:

  • outputs <-> y,
  • inputs <-> x, and
  • grad_outputs <-> dL/dy.

Looking at the shapes: dL/dx should have the same shape as x (dL/dx can be referred to as the 'gradient' of x), while dy/dx, the Jacobian matrix, would be 3-dimensional. On the other hand dL/dy, which is the incoming gradient, should have the same shape as the output, i.e., y's shape.

We want to compute dL/dx = dL/dy*dy/dx. If we look more closely, we have

dy/dx = [dy_jk/dx_i]_ijk for i, j, k ∈ [1, n]³

Therefore,

dL/dx = [dL/d_x_i]_i, i ∈ [1,n]
      = [sum(dL/dy_jk * d(y_jk)/dx_i over j, k ∈ [1, n]²]_i, i ∈ [1,n]

Back to your example, it means for a given i ∈ [1, n]: dL/dx_i = sum(dy_jk/dx_i) over j, k ∈ [1,n]². And dy_jk/dx_i = f(x_j*x_k)/dx_i will equal x_j if i = k, x_k if i = j, and 2*x_i if i = j = k (because of the squared x_i). This being said matrix y is symmetric... So the result comes down to 2*sum(x_i) over i ∈ [1, n]

This means dL/dx is the column vector [2*sum(x)]_i for i ∈ [1, n].

>>> 2*x.sum()*torch.ones_like(x)
tensor([20., 20., 20., 20.])

Stepping back look at this other graph example, here adding an additional operation after y:

  x   ------->  y = x*xT  -------->  z = y²

If you look at the backward pass on this graph, you have:

dL/dx <-------   dL/dy    <--------  dL/dz
        dy/dx              dz/dy 
  x   ------->  y = x*xT  -------->  z = y²

With dL/dx = dL/dy*dy/dx = dL/dz*dz/dy*dy/dx which is in practice computed in two sequential steps: dL/dy = dL/dz*dz/dy, then dL/dx = dL/dy*dy/dx.

user118967
  • 4,895
  • 5
  • 33
  • 54
Ivan
  • 34,531
  • 8
  • 55
  • 100
  • 1
    Thank you! A few clarifications. You say "we want to compute dL/dx = dL/dy*dy/dx". Does that mean `grad` returns dL/dx? I am not seeing that because dL/dy = `ones(4,4)` and dy/dx = `[20,20,20,20]` per your calculation, but dL/dx = `ones(4,4)*[20,20,20,20]` does not result in the returned value `[20,20,20,20]`. What am I missing? – user118967 Aug 15 '21 at 21:27
  • 1
    Also, I would expect dy/dx to be a n^3 matrix, where dy/dx_{j,k,i} = d x_j*x_k/dx_i (by generalizing the idea of a Jacobian where the function is usually vector-shaped while here it is matrix-shaped). I'm particularly unclear about the fact that you used summations. Were you computing the Jacobian determinant? Can you please tell the name of the mathematical concept you used for that computation? – user118967 Aug 15 '21 at 21:31
  • Thanks for your interest. In this case yes `dL/dx` would be `x.grad`. Here when computing `dL/dx`, I actually brushed that off in saying `dL/dx = dL/dy*dy/dx`, indeed it is not a simple multiplication as you might expect. Your reasoning is accurate: the Jacobian is a 3-dimensional object where `dy/dx_{j,k,i} = d x_j*x_k/dx_i` (as you rightly said). Since `dL/dy` is shaped like `y` (*2-dimensional*), the product `dL/dy*dy/dx` ends up being a 3D x 2D operation. If you've ever used *einsum*, this would look like `torch.einsum('jk,jki->i', dL/dy, dy/dx)`... – Ivan Aug 15 '21 at 22:07
  • 1
    ... *i.e* for each 'matrix layer' of `dy/dx` (which you've indexed by `i` - corresponding to all the `dy/dx_i` components), you point-wise multiple with `dL/dy` and reduce to a single value. The resulting object is 1-dimensional: vector `[dL/dx]_i = sum(d(y_jk)/dx_i*dL/dy_jk, over j,k in [1, n]²)]_i for i in [1, n]`. I have updated the answer above, correcting some elements and imprecisions and adding more details on the Jacobian matrix (the diff can be found [here](https://stackoverflow.com/posts/68781125/revisions)). Don't hesitate if you have any issues or points that need clarification. – Ivan Aug 15 '21 at 22:33
  • Thank you, this second version is much better! It's unfortunate that I chose `y` to be a matrix function as it made things much more complicated than they needed to be and to go much farther than a simple explanation of `autograd.grad`, but you managed to contain that complication quite well while providing the meaning of the parameter of the `autograd.grad`, which was the essence of the question. – user118967 Aug 23 '21 at 23:10