Autograd.grad() for Tensor in pytorch

Question

I want to compute the gradient between two tensors in a net. The input X tensor (batch size x m) is sent through a set of convolutional layers which give me back and output Y tensor(batch size x n).

I’m creating a new loss and I would like to know the gradient of Y w.r.t. X. Something that in tensorflow would be like:

tf.gradients(ys=Y, xs=X)

Unfortunately, I’ve been making tests with torch.autograd.grad(), but I could not figure out how to do it. I get errors like: “RunTimeerror: grad can be implicitly created only for scalar outputs”.

What should be the inputs in torch.autograd.grad() if I want to know the gradient of Y w.r.t. X?

Possible duplicate of [Pytorch. Can autograd be used when the final tensor has more than a single value in it?](https://stackoverflow.com/questions/53273662/pytorch-can-autograd-be-used-when-the-final-tensor-has-more-than-a-single-value) — MichaelSB, Feb 18 '19 at 22:13

trsvchn · Accepted Answer · 2019-08-02T16:52:54.983

Let's start from simple working example with plain loss function and regular backward. We will build short computational graph and do some grad computations on it.

Code:

import torch
from torch.autograd import grad
import torch.nn as nn


# Create some dummy data.
x = torch.ones(2, 2, requires_grad=True)
gt = torch.ones_like(x) * 16 - 0.5  # "ground-truths" 

# We will use MSELoss as an example.
loss_fn = nn.MSELoss()

# Do some computations.
v = x + 2
y = v ** 2

# Compute loss.
loss = loss_fn(y, gt)

print(f'Loss: {loss}')

# Now compute gradients:
d_loss_dx = grad(outputs=loss, inputs=x)
print(f'dloss/dx:\n {d_loss_dx}')

Output:

Loss: 42.25
dloss/dx:
(tensor([[-19.5000, -19.5000], [-19.5000, -19.5000]]),)

Ok, this works! Now let's try to reproduce error "grad can be implicitly created only for scalar outputs". As you can notice, loss in previous example is a scalar. backward() and grad() by defaults deals with single scalar value: loss.backward(torch.tensor(1.)). If you try to pass tensor with more values you will get an error.

Code:

v = x + 2
y = v ** 2

try:
    dy_hat_dx = grad(outputs=y, inputs=x)
except RuntimeError as err:
    print(err)

Output:

grad can be implicitly created only for scalar outputs

Therefore, when using grad() you need to specify grad_outputs parameter as follows:

Code:

v = x + 2
y = v ** 2

dy_dx = grad(outputs=y, inputs=x, grad_outputs=torch.ones_like(y))
print(f'dy/dx:\n {dy_dx}')

dv_dx = grad(outputs=v, inputs=x, grad_outputs=torch.ones_like(v))
print(f'dv/dx:\n {dv_dx}')

Output:

dy/dx:
(tensor([[6., 6.],[6., 6.]]),)

dv/dx:
(tensor([[1., 1.], [1., 1.]]),)

NOTE: If you are using backward() instead, simply do y.backward(torch.ones_like(y)).

Nice answer, but what is the general meaning of `grad_outputs`? Are there cases in which we would need to use something other than `grad_outputs=torch.ones_like(outputs)`? If the solution is always the same, why doesn't `grad` simply assume `grad_outputs=torch.ones_like(outputs)` instead of throwing an error? — user118967, Aug 13 '21 at 20:29

bpfrd · Answer 2 · 2022-07-18T14:57:54.193

The above solution is not totally correct. It's only correct in a special case where output dimension is 1.

As mentioned in the docs, the output of torch.autograd.grad is related to derivatives but it's not actually dy/dx. For example, assume you have a neural network that inputs a tensor of shape (batch_size, input_dim) and outputs a tensor with shape (batch_size, output_dim). The derivatives of the output w.r.t. input should be of shape (batch_size, output_dim, input_dim) but what you get from torch.autograd.grad has shape (batch_size, input_dim), which is the sum of the real derivatives over the output dimension. If you want the correct derivatives you should use torch.autograd.functional.jacobian as follows:

import torch
torch.>>> torch.__version__
'1.10.1+cu111'
>>>

#!/usr/bin/env python
# coding: utf-8

import torch
from torch import nn
import numpy as np


batch_size = 10
hidden_dim = 20
input_dim = 3
output_dim = 2 

model = nn.Sequential(nn.Linear(input_dim, hidden_dim), nn.Tanh(), nn.Linear(hidden_dim, output_dim)).double()
x = torch.rand(batch_size, input_dim, requires_grad=True, dtype=torch.float64) #(batch_size, input_dim)
y = model(x) #y: (batch_size, output_dim) 

#using torch.autograd.grad
dydx1 = torch.autograd.grad(y, x, retain_graph=True, grad_outputs=torch.ones_like(y))[0]  #dydx1: (batch_size, input_dim)
print(f' using grad dydx1: {dydx1.shape}')

#using torch.autograd.functional.jacobian
j = torch.autograd.functional.jacobian(lambda t: model(t), x) #j: (batch_size, output_dim, batch_size, input_dim)

#the off-diagonal elements of 0th and 2nd dimension are all zero. So we remove them
dydx2 = torch.diagonal(j, offset=0, dim1=0, dim2=2) #dydx2: (output_dim, input_dim, batch_size)
dydx2 = dydx2.permute(2, 0, 1) #dydx2: (batch_size, output_dim, input_dim)
print(f' using jacobian dydx2: {dydx2.shape}')

#round to 14 decimal digits to avoid noise 
print(np.round((dydx2.sum(dim=1)).numpy(), 14) == np.round(dydx1.numpy(), 14))

Output:

>using grad dydx1: torch.Size([10, 3])

>using jacobian dydx2: torch.Size([10, 2, 3])

#dydx2.sum(dim=1) == dydx1
>[[ True  True  True]
 [ True  True  True]
 [ True  True  True]
 [ True  True  True]
 [ True  True  True]
 [ True  True  True]
 [ True  True  True]
 [ True  True  True]
 [ True  True  True]
 [ True  True  True]]

In fact autograd.grad returns the sum of the dydx over output dimension.

If you really want to use torch.autograd.grad there is an inefficient way to do that:

dydx3 = torch.tensor([], dtype=torch.float64)

for i in range(output_dim):
    l = torch.zeros_like(y)
    l[:, i] = 1.
    d = torch.autograd.grad(y, x, retain_graph=True, grad_outputs=l)[0]  #dydx: (batch_size, input_dim)
    dydx3 = torch.concat((dydx3, d.unsqueeze(dim=1)), dim=1)


print(f' dydx3: {dydx3.shape}')
print(np.round(dydx3.numpy(), 14) == np.round(dydx2.numpy(), 14))

Output:

 dydx3: torch.Size([10, 2, 3])
[[[ True  True  True]
  [ True  True  True]]

 [[ True  True  True]
  [ True  True  True]]

 [[ True  True  True]
  [ True  True  True]]

 [[ True  True  True]
  [ True  True  True]]

 [[ True  True  True]
  [ True  True  True]]

 [[ True  True  True]
  [ True  True  True]]

 [[ True  True  True]
  [ True  True  True]]

 [[ True  True  True]
  [ True  True  True]]

 [[ True  True  True]
  [ True  True  True]]

 [[ True  True  True]
  [ True  True  True]]]

I hope it helps.

P.S. I used retain_graph=True because of multiple backward calls.

Autograd.grad() for Tensor in pytorch

2 Answers2

Linked