4

Why does:

with torch.no_grad():
     w = w - lr*w.grad
     print(w)

results in:

tensor(0.9871)

and

with torch.no_grad():
     w -= lr*w.grad
     print(w)

results in:

tensor(0.9871, requires_grad=True)

Aren't both operations the same?

Here is some test code:

def test_stack(): 
    np.random.seed(0)
    n = 50
    feat1 = np.random.randn(n, 1)
    feat2 = np.random.randn(n, 1)
    
    X = torch.tensor(feat1).view(-1, 1)
    Y = torch.tensor(feat2).view(-1, 1)
    
    w = torch.tensor(1.0, requires_grad=True)
    
    epochs = 1
    lr = 0.001
    
    for epoch in range(epochs):
        for i in range(len(X)):
            y_pred = w*X[i]
            loss = (y_pred - Y[i])**2
            loss.backward()
            
            with torch.no_grad():
                #w = w - lr*w.grad  # DOESN'T WORK!!!!
                #print(w); return
                w -= lr*w.grad
                print(w); return

                w.grad.zero_()

Remove the comments and you'll se the requires_grad disappearing. Could this be a bug?

Tony Power
  • 1,033
  • 11
  • 23

1 Answers1

2

I had the same issue and it boggled me. I asked chatGPT, and it turns out that normal subtraction creates a new tensor with requires_grad set to False, while augmented assignment works in-place, retaining the requires_grad property.

Let's see with an example

We will track the id of the object via the id() function, which returns an integer that's unique for every object in memory.

Normal subtraction

import torch
x = torch.tensor(5.0, requires_grad = True)
id1 = id(x) # the id for the tensor object referenced by x
y = torch.tensor(3.0)
x = x - y
id2 = id(x) # the id for the new tensor object referenced by x
print(id1 == id2) # prints False
print(x.requires_grad) # prints False
  • The reason why the ids are different is because the subtraction operation returns a different tensor object, with requires_grad set to False
  • Using the old x handle has no effect on whether or not a new object gets created.
  • Using the old handle only means that we no longer have a reference to the old tensor object, and it will be garbage collected.

Now let's see augmented assignment

import torch
x = torch.tensor(5.0, requires_grad = True)
id1 = id(x) # the id for the tensor object referenced by x
y = torch.tensor(3.0)
x -= y
id2 = id(x)
print(id1 == id2) # prints True
print(x.requires_grad) # prints True

Now, with augmented assignment, the subtraction is in place. This means the old object is modified, without having to create a new one. Because of that, the ids before and after subtraction remains the same, because x still references the same object.

But wait, why do subtraction and augmented assignment work differently?

This is because they can be implemented using different dunder methods. I think this forum can explain well. But, the gist is, python understands these operators differently. They're not just syntactic sugar. This is why there's a discrepancy between the two seemingly identical operations.

  • 1
    Thanks for the reply. I wish I'd know the reason behind that lack of uniformity. I'd expect both would to do exactly the same. A bug maybe? – Tony Power May 11 '23 at 15:24
  • 1
    @TonyPower Maybe. I don't know how it's implemented under the hood. But you're right, it's confusing, and the behavior is not immediately obvious. – Hazem Khairy May 12 '23 at 17:21