pytorch - connection between loss.backward() and optimizer.step()

Question

Where is an explicit connection between the optimizer and the loss?

How does the optimizer know where to get the gradients of the loss without a call liks this optimizer.step(loss)?

-More context-

When I minimize the loss, I didn't have to pass the gradients to the optimizer.

loss.backward() # Back Propagation
optimizer.step() # Gardient Descent

Shai · Accepted Answer · 2021-02-08T18:25:20.100

124

Without delving too deep into the internals of pytorch, I can offer a simplistic answer:

Recall that when initializing optimizer you explicitly tell it what parameters (tensors) of the model it should be updating. The gradients are "stored" by the tensors themselves (they have a grad and a requires_grad attributes) once you call backward() on the loss. After computing the gradients for all tensors in the model, calling optimizer.step() makes the optimizer iterate over all parameters (tensors) it is supposed to update and use their internally stored grad to update their values.

More info on computational graphs and the additional "grad" information stored in pytorch tensors can be found in this answer.

Referencing the parameters by the optimizer can sometimes cause troubles, e.g., when the model is moved to GPU after initializing the optimizer. Make sure you are done setting up your model before constructing the optimizer. See this answer for more details.

edited Feb 08 '21 at 18:25

answered Dec 30 '18 at 06:39

Shai

111,146
38
238
371

25

@Aerin it's not a trivial connection... One would have expect `optimizer.step` to get `loss.backward()` as an argument. However, it all happens "behind the curtain"... – Shai Dec 30 '18 at 06:52
19

So how does optimizer.step() get the gradient value from loss.backward(). It seems that this answer hasn't answered the mechanism for the "connection". Optimizer has reference to model parameters. But loss function is completely on its own. It doens't look like it has reference to model or optimizer. – mofury Mar 29 '20 at 21:53
5

@cfeng the loss function is not on its own at all! It is the final leaf in a single gigantic computational graph which starts with the model inputs and contains all model parameters. This graph is computed for each batch and results in a single scalar number on each batch. When we do `loss.backward()` the process of backpropagation starts at the loss and goes through all of its parents all the way to model inputs. All nodes in the graph contain a reference to their parent. – pseudomarvin Aug 29 '20 at 20:12
4

@mofury The question isn't that simple to answer in short. Roughly speaking, first, the instance of a loss function class, say, an instance of the `nn.CrossEntropyLoss` can be called and return a `Tensor`. That's important, this `Tensor` object has a `grad_fn` prop in which there stores tensors it is derived from. And those tensors also have such a prop so that the `backward` function can do a backpropagation through such props and eventually arrive at the parameters we want to optimize in the model. You can refer to this: https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html. – C.K. Aug 02 '21 at 13:30
1

is it correct to say that optimizer.step is only used in train phase and not in validation phase? I am aware that loss.backward is only used in train phase and not in validation phase. If it is correct, I am then confused why we then would bother with zeroing validation phase optimizer gradients using optimizer.zero_grad. – Mona Jalal Mar 09 '22 at 18:45
2

@MonaJalal where did you see an optimizer in a validation code? Optimizer has nothing to do with validation/test – Shai Mar 09 '22 at 19:13
1

https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html#inception-v3 uses optimizer.zero_grad also for validation phase which I think shouldn't – Mona Jalal Mar 09 '22 at 19:19
1

I modified it to https://pastebin.com/raw/0QH7FHnj for myself – Mona Jalal Mar 09 '22 at 19:20
1

@MonaJalal there is no optimizer in the code snippet you pasted, or am I missing something? Don't confuse `torxh.no_grad()` with `optimizer.zero_grad()`: these are two completely different things – Shai Mar 09 '22 at 19:40
well, I believe we should use torch.no_grad for evaluation mode. – Mona Jalal Mar 09 '22 at 19:50

score 55 · Answer 2 · edited Feb 27 '19 at 14:32

55

When you call loss.backward(), all it does is compute gradient of loss w.r.t all the parameters in loss that have requires_grad = True and store them in parameter.grad attribute for every parameter.

optimizer.step() updates all the parameters based on parameter.grad

edited Feb 27 '19 at 14:32

Morteza Jalambadani

2,190
6
21
35

answered Feb 27 '19 at 13:26

Ganesh

551
3
2

6

`loss` is computing the loss between two tensors with no relation to a network. How does `loss.backward()` know which network it needs to reference and compute `parameter.grad` for? – Aziz Alfoudari May 21 '20 at 07:53
@AzizAlfoudari see my answer for a clarification attempt :). – pseudomarvin Aug 29 '20 at 20:14

pseudomarvin · Answer 3 · 2022-09-09T07:45:35.610

Perhaps this will clarify a little the connection between loss.backward and optim.step (although the other answers are to the point).

# Our "model"
x = torch.tensor([1., 2.], requires_grad=True)
y = 100*x

# Compute loss
loss = y.sum()

# Compute gradient of the loss w.r.t. to the parameters  
print(x.grad)     # None
loss.backward()      
print(x.grad)     # tensor([100., 100.])

# MOdify the parameters by subtracting the gradient
optim = torch.optim.SGD([x], lr=0.001)
print(x)        # tensor([1., 2.], requires_grad=True)
optim.step()
print(x)        # tensor([0.9000, 1.9000], requires_grad=True)

loss.backward() sets the grad attribute of all tensors with requires_grad=True in the computational graph of which loss is the leaf (only x in this case).

Optimizer just iterates through the list of parameters (tensors) it received on initialization and everywhere where a tensor has requires_grad=True, it subtracts the value of its gradient stored in its .grad property (simply multiplied by the learning rate in case of SGD). It doesn't need to know with respect to what loss the gradients were computed it just wants to access that .grad property so it can do x = x - lr * x.grad

Note that if we were doing this in a train loop we would call optim.zero_grad() because in each train step we want to compute new gradients - we don't care about gradients from the previous batch. Not zeroing grads would lead to gradient accumulation across batches.

I like this kind of "hands on" explanation to understand things. Thanks, makes much more sense to me! — kushy, Sep 25 '20 at 16:23
It should be "Compute gradients of the loss w.r.t. the parameters." — passerby51, Sep 09 '22 at 00:25

score 32 · Answer 4 · answered Feb 14 '21 at 03:55

Some answers explained well, but I'd like to give a specific example to explain the mechanism.

Suppose we have a function : z = 3 x^2 + y^3.
The updating gradient formula of z w.r.t x and y is:

initial values are x=1 and y=2.

x = torch.tensor([1.0], requires_grad=True)
y = torch.tensor([2.0], requires_grad=True)
z = 3*x**2+y**3

print("x.grad: ", x.grad)
print("y.grad: ", y.grad)
print("z.grad: ", z.grad)

# print result should be:
x.grad:  None
y.grad:  None
z.grad:  None

Then calculating the gradient of x and y in current value (x=1, y=2)

# calculate the gradient
z.backward()

print("x.grad: ", x.grad)
print("y.grad: ", y.grad)
print("z.grad: ", z.grad)

# print result should be:
x.grad:  tensor([6.])
y.grad:  tensor([12.])
z.grad:  None

Finally, using SGD optimizer to update the value of x and y according the formula:

# create an optimizer, pass x,y as the paramaters to be update, setting the learning rate lr=0.1
optimizer = optim.SGD([x, y], lr=0.1)

# executing an update step
optimizer.step()

# print the updated values of x and y
print("x:", x)
print("y:", y)

# print result should be:
x: tensor([0.4000], requires_grad=True)
y: tensor([0.8000], requires_grad=True)

Very good explanations, the analogy with the real example is great honestly — Timbus Calin, Jun 25 '22 at 11:20

Akavall · Answer 5 · 2020-07-29T03:48:28.597

29

Let's say we defined a model: model, and loss function: criterion and we have the following sequence of steps:

pred = model(input)
loss = criterion(pred, true_labels)
loss.backward()

pred will have an grad_fn attribute, that references a function that created it, and ties it back to the model. Therefore, loss.backward() will have information about the model it is working with.

Try removing grad_fn attribute, for example with:

pred = pred.clone().detach()

Then the model gradients will be None and consequently weights will not get updated.

And the optimizer is tied to the model because we pass model.parameters() when we create the optimizer.

edited Jul 29 '20 at 03:48

answered May 25 '20 at 23:49

Akavall

82,592
51
207
251

1

shouldn't it be "Then the models **gradients** will not get updated.", since loss.backward() updates the gradients? – zwithouta Jul 28 '20 at 13:46
@zwithouta, Thanks, this is a good point. I updated my answer. – Akavall Jul 29 '20 at 03:49

score -2 · Answer 6 · edited Aug 02 '20 at 09:31

-2

Short answer:

loss.backward() # do gradient of all parameters for which we set required_grad= True. parameters could be any variable defined in code, like h2h or i2h.

optimizer.step() # according to the optimizer function (defined previously in our code), we update those parameters to finally get the minimum loss(error).

edited Aug 02 '20 at 09:31

sɐunıɔןɐqɐp

3,332
15
36
40

answered Aug 02 '20 at 07:56

pourya

67
1
3

pytorch - connection between loss.backward() and optimizer.step()

6 Answers6

Linked