9

I'm still working on my understanding of the PyTorch autograd system. One thing I'm struggling at is to understand why .clamp(min=0) and nn.functional.relu() seem to have different backward passes.

It's especially confusing as .clamp is used equivalently to relu in PyTorch tutorials, such as https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-nn.

I found this when analysing the gradients of a simple fully connected net with one hidden layer and a relu activation (linear in the outputlayer).

to my understanding the output of the following code should be just zeros. I hope someone can show me what I am missing.

import torch
dtype = torch.float

x = torch.tensor([[3,2,1],
                  [1,0,2],
                  [4,1,2],
                  [0,0,1]], dtype=dtype)

y = torch.ones(4,4)

w1_a = torch.tensor([[1,2],
                     [0,1],
                     [4,0]], dtype=dtype, requires_grad=True)
w1_b = w1_a.clone().detach()
w1_b.requires_grad = True



w2_a = torch.tensor([[-1, 1],
                     [-2, 3]], dtype=dtype, requires_grad=True)
w2_b = w2_a.clone().detach()
w2_b.requires_grad = True


y_hat_a = torch.nn.functional.relu(x.mm(w1_a)).mm(w2_a)
y_a = torch.ones_like(y_hat_a)
y_hat_b = x.mm(w1_b).clamp(min=0).mm(w2_b)
y_b = torch.ones_like(y_hat_b)

loss_a = (y_hat_a - y_a).pow(2).sum()
loss_b = (y_hat_b - y_b).pow(2).sum()

loss_a.backward()
loss_b.backward()

print(w1_a.grad - w1_b.grad)
print(w2_a.grad - w2_b.grad)

# OUT:
# tensor([[  0.,   0.],
#         [  0.,   0.],
#         [  0., -38.]])
# tensor([[0., 0.],
#         [0., 0.]])
# 
DaFlooo
  • 91
  • 1
  • 6
  • 1
    I haven't analyzed your code carefully yet but unless there's a bug in pytorch then the only potential difference I see is the gradient of relu and clamp at 0. The activation function is continuous but not differentiable at 0. From optimization theory this means we should choose a subgradient of the function at this point since a gradient doesn't exist. Intuitively a subgradient is a slope which is tangent to the function at this point. For relu/clamp the subgradients at 0 are all values in the interval [0,1]. My guess is that relu and clamp choose different subgradients here. – jodag Mar 10 '20 at 15:34
  • 4
    @jodag You are right that `clamp` and `relu` produce different gradients at `0`. I checked with a scalar tensor `x = 0` the two versions: `(x.clamp(min=0) - 1.0).pow(2).backward()` versus `(relu(x) - 1.0).pow(2).backward()`. The resulting `x.grad` is 0 for the ReLU version but it is -2 for the clamp version. That means ReLU chooses `x == 0 --> grad = 0` while `clamp` chooses `x == 0 --> grad = 1`. – a_guest Mar 10 '20 at 19:20
  • @a_guest, good catch! Your comment should be an answer. – iGian Aug 13 '20 at 19:04

1 Answers1

7

The reason is that relu and clamp produce different gradients at 0. For a scalar tensor x = 0:

  • (relu(x) - 1.0).pow(2).backward() gives x.grad == 0
  • (x.clamp(min=0) - 1.0).pow(2).backward() gives x.grad == -2

This indicates that:

  • relu chooses x == 0 --> grad = 0
  • clamp chooses x == 0 --> grad = 1
Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
a_guest
  • 34,165
  • 12
  • 64
  • 118