Pytorch Autograd gives different gradients when using .clamp instead of torch.relu

Question

I'm still working on my understanding of the PyTorch autograd system. One thing I'm struggling at is to understand why .clamp(min=0) and nn.functional.relu() seem to have different backward passes.

It's especially confusing as .clamp is used equivalently to relu in PyTorch tutorials, such as https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-nn.

I found this when analysing the gradients of a simple fully connected net with one hidden layer and a relu activation (linear in the outputlayer).

to my understanding the output of the following code should be just zeros. I hope someone can show me what I am missing.

import torch
dtype = torch.float

x = torch.tensor([[3,2,1],
                  [1,0,2],
                  [4,1,2],
                  [0,0,1]], dtype=dtype)

y = torch.ones(4,4)

w1_a = torch.tensor([[1,2],
                     [0,1],
                     [4,0]], dtype=dtype, requires_grad=True)
w1_b = w1_a.clone().detach()
w1_b.requires_grad = True



w2_a = torch.tensor([[-1, 1],
                     [-2, 3]], dtype=dtype, requires_grad=True)
w2_b = w2_a.clone().detach()
w2_b.requires_grad = True


y_hat_a = torch.nn.functional.relu(x.mm(w1_a)).mm(w2_a)
y_a = torch.ones_like(y_hat_a)
y_hat_b = x.mm(w1_b).clamp(min=0).mm(w2_b)
y_b = torch.ones_like(y_hat_b)

loss_a = (y_hat_a - y_a).pow(2).sum()
loss_b = (y_hat_b - y_b).pow(2).sum()

loss_a.backward()
loss_b.backward()

print(w1_a.grad - w1_b.grad)
print(w2_a.grad - w2_b.grad)

# OUT:
# tensor([[  0.,   0.],
#         [  0.,   0.],
#         [  0., -38.]])
# tensor([[0., 0.],
#         [0., 0.]])
#

I haven't analyzed your code carefully yet but unless there's a bug in pytorch then the only potential difference I see is the gradient of relu and clamp at 0. The activation function is continuous but not differentiable at 0. From optimization theory this means we should choose a subgradient of the function at this point since a gradient doesn't exist. Intuitively a subgradient is a slope which is tangent to the function at this point. For relu/clamp the subgradients at 0 are all values in the interval [0,1]. My guess is that relu and clamp choose different subgradients here. — jodag, Mar 10 '20 at 15:34
@jodag You are right that `clamp` and `relu` produce different gradients at `0`. I checked with a scalar tensor `x = 0` the two versions: `(x.clamp(min=0) - 1.0).pow(2).backward()` versus `(relu(x) - 1.0).pow(2).backward()`. The resulting `x.grad` is 0 for the ReLU version but it is -2 for the clamp version. That means ReLU chooses `x == 0 --> grad = 0` while `clamp` chooses `x == 0 --> grad = 1`. — a_guest, Mar 10 '20 at 19:20

score 7 · Accepted Answer · edited Aug 27 '23 at 10:51

7

The reason is that relu and clamp produce different gradients at 0. For a scalar tensor x = 0:

(relu(x) - 1.0).pow(2).backward() gives x.grad == 0
(x.clamp(min=0) - 1.0).pow(2).backward() gives x.grad == -2

This indicates that:

relu chooses x == 0 --> grad = 0
clamp chooses x == 0 --> grad = 1

edited Aug 27 '23 at 10:51

Mateen Ulhaq

24,552
19
101
135

answered Aug 14 '20 at 00:32

a_guest

34,165
12
64
118

Pytorch Autograd gives different gradients when using .clamp instead of torch.relu

1 Answers1