166

Where is an explicit connection between the optimizer and the loss?

How does the optimizer know where to get the gradients of the loss without a call liks this optimizer.step(loss)?

-More context-

When I minimize the loss, I didn't have to pass the gradients to the optimizer.

loss.backward() # Back Propagation
optimizer.step() # Gardient Descent
Shai
  • 111,146
  • 38
  • 238
  • 371
aerin
  • 20,607
  • 28
  • 102
  • 140

6 Answers6

124

Without delving too deep into the internals of pytorch, I can offer a simplistic answer:

Recall that when initializing optimizer you explicitly tell it what parameters (tensors) of the model it should be updating. The gradients are "stored" by the tensors themselves (they have a grad and a requires_grad attributes) once you call backward() on the loss. After computing the gradients for all tensors in the model, calling optimizer.step() makes the optimizer iterate over all parameters (tensors) it is supposed to update and use their internally stored grad to update their values.

More info on computational graphs and the additional "grad" information stored in pytorch tensors can be found in this answer.

Referencing the parameters by the optimizer can sometimes cause troubles, e.g., when the model is moved to GPU after initializing the optimizer. Make sure you are done setting up your model before constructing the optimizer. See this answer for more details.

Shai
  • 111,146
  • 38
  • 238
  • 371
  • 25
    @Aerin it's not a trivial connection... One would have expect `optimizer.step` to get `loss.backward()` as an argument. However, it all happens "behind the curtain"... – Shai Dec 30 '18 at 06:52
  • 19
    So how does optimizer.step() get the gradient value from loss.backward(). It seems that this answer hasn't answered the mechanism for the "connection". Optimizer has reference to model parameters. But loss function is completely on its own. It doens't look like it has reference to model or optimizer. – mofury Mar 29 '20 at 21:53
  • 5
    @cfeng the loss function is not on its own at all! It is the final leaf in a single gigantic computational graph which starts with the model inputs and contains all model parameters. This graph is computed for each batch and results in a single scalar number on each batch. When we do `loss.backward()` the process of backpropagation starts at the loss and goes through all of its parents all the way to model inputs. All nodes in the graph contain a reference to their parent. – pseudomarvin Aug 29 '20 at 20:12
  • 4
    @mofury The question isn't that simple to answer in short. Roughly speaking, first, the instance of a loss function class, say, an instance of the `nn.CrossEntropyLoss` can be called and return a `Tensor`. That's important, this `Tensor` object has a `grad_fn` prop in which there stores tensors it is derived from. And those tensors also have such a prop so that the `backward` function can do a backpropagation through such props and eventually arrive at the parameters we want to optimize in the model. You can refer to this: https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html. – C.K. Aug 02 '21 at 13:30
  • 1
    is it correct to say that optimizer.step is only used in train phase and not in validation phase? I am aware that loss.backward is only used in train phase and not in validation phase. If it is correct, I am then confused why we then would bother with zeroing validation phase optimizer gradients using optimizer.zero_grad. – Mona Jalal Mar 09 '22 at 18:45
  • 2
    @MonaJalal where did you see an optimizer in a validation code? Optimizer has nothing to do with validation/test – Shai Mar 09 '22 at 19:13
  • 1
    https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html#inception-v3 uses optimizer.zero_grad also for validation phase which I think shouldn't – Mona Jalal Mar 09 '22 at 19:19
  • 1
    I modified it to https://pastebin.com/raw/0QH7FHnj for myself – Mona Jalal Mar 09 '22 at 19:20
  • 1
    @MonaJalal there is no optimizer in the code snippet you pasted, or am I missing something? Don't confuse `torxh.no_grad()` with `optimizer.zero_grad()`: these are two completely different things – Shai Mar 09 '22 at 19:40
  • well, I believe we should use torch.no_grad for evaluation mode. – Mona Jalal Mar 09 '22 at 19:50
55

When you call loss.backward(), all it does is compute gradient of loss w.r.t all the parameters in loss that have requires_grad = True and store them in parameter.grad attribute for every parameter.

optimizer.step() updates all the parameters based on parameter.grad

Morteza Jalambadani
  • 2,190
  • 6
  • 21
  • 35
Ganesh
  • 551
  • 3
  • 2
  • 6
    `loss` is computing the loss between two tensors with no relation to a network. How does `loss.backward()` know which network it needs to reference and compute `parameter.grad` for? – Aziz Alfoudari May 21 '20 at 07:53
  • @AzizAlfoudari see my answer for a clarification attempt :). – pseudomarvin Aug 29 '20 at 20:14
50

Perhaps this will clarify a little the connection between loss.backward and optim.step (although the other answers are to the point).

# Our "model"
x = torch.tensor([1., 2.], requires_grad=True)
y = 100*x

# Compute loss
loss = y.sum()

# Compute gradient of the loss w.r.t. to the parameters  
print(x.grad)     # None
loss.backward()      
print(x.grad)     # tensor([100., 100.])

# MOdify the parameters by subtracting the gradient
optim = torch.optim.SGD([x], lr=0.001)
print(x)        # tensor([1., 2.], requires_grad=True)
optim.step()
print(x)        # tensor([0.9000, 1.9000], requires_grad=True)

loss.backward() sets the grad attribute of all tensors with requires_grad=True in the computational graph of which loss is the leaf (only x in this case).

Optimizer just iterates through the list of parameters (tensors) it received on initialization and everywhere where a tensor has requires_grad=True, it subtracts the value of its gradient stored in its .grad property (simply multiplied by the learning rate in case of SGD). It doesn't need to know with respect to what loss the gradients were computed it just wants to access that .grad property so it can do x = x - lr * x.grad

Note that if we were doing this in a train loop we would call optim.zero_grad() because in each train step we want to compute new gradients - we don't care about gradients from the previous batch. Not zeroing grads would lead to gradient accumulation across batches.

pseudomarvin
  • 1,477
  • 2
  • 17
  • 32
32

Some answers explained well, but I'd like to give a specific example to explain the mechanism.

Suppose we have a function : z = 3 x^2 + y^3.
The updating gradient formula of z w.r.t x and y is:

enter image description here

initial values are x=1 and y=2.

x = torch.tensor([1.0], requires_grad=True)
y = torch.tensor([2.0], requires_grad=True)
z = 3*x**2+y**3

print("x.grad: ", x.grad)
print("y.grad: ", y.grad)
print("z.grad: ", z.grad)

# print result should be:
x.grad:  None
y.grad:  None
z.grad:  None

Then calculating the gradient of x and y in current value (x=1, y=2)

enter image description here

# calculate the gradient
z.backward()

print("x.grad: ", x.grad)
print("y.grad: ", y.grad)
print("z.grad: ", z.grad)

# print result should be:
x.grad:  tensor([6.])
y.grad:  tensor([12.])
z.grad:  None

Finally, using SGD optimizer to update the value of x and y according the formula: enter image description here

# create an optimizer, pass x,y as the paramaters to be update, setting the learning rate lr=0.1
optimizer = optim.SGD([x, y], lr=0.1)

# executing an update step
optimizer.step()

# print the updated values of x and y
print("x:", x)
print("y:", y)

# print result should be:
x: tensor([0.4000], requires_grad=True)
y: tensor([0.8000], requires_grad=True)
LollipopKnight
  • 431
  • 4
  • 6
29

Let's say we defined a model: model, and loss function: criterion and we have the following sequence of steps:

pred = model(input)
loss = criterion(pred, true_labels)
loss.backward()

pred will have an grad_fn attribute, that references a function that created it, and ties it back to the model. Therefore, loss.backward() will have information about the model it is working with.

Try removing grad_fn attribute, for example with:

pred = pred.clone().detach()

Then the model gradients will be None and consequently weights will not get updated.

And the optimizer is tied to the model because we pass model.parameters() when we create the optimizer.

Akavall
  • 82,592
  • 51
  • 207
  • 251
-2

Short answer:

loss.backward() # do gradient of all parameters for which we set required_grad= True. parameters could be any variable defined in code, like h2h or i2h.

optimizer.step() # according to the optimizer function (defined previously in our code), we update those parameters to finally get the minimum loss(error).

sɐunıɔןɐqɐp
  • 3,332
  • 15
  • 36
  • 40
pourya
  • 67
  • 1
  • 3