52

enter image description here

Such as this, I want to using some auxiliary loss to promoting my model performance.
Which type code can implement it in pytorch?

#one
loss1.backward()
loss2.backward()
loss3.backward()
optimizer.step()
#two
loss1.backward()
optimizer.step() 
loss2.backward()
optimizer.step() 
loss3.backward()
optimizer.step()   
#three
loss = loss1+loss2+loss3
loss.backward()
optimizer.step()

Thanks for your answer!

KonstantinosKokos
  • 3,369
  • 1
  • 11
  • 21
heiheihei
  • 659
  • 1
  • 6
  • 15

5 Answers5

65

First and 3rd attempt are exactly the same and correct, while 2nd approach is completely wrong.

In Pytorch, low layer gradients are Not "overwritten" by subsequent backward() calls, rather they are accumulated, or summed. This makes first and 3rd approach identical, though 1st approach might be preferable if you have low-memory GPU/RAM (a batch size of 1024 with one backward() + step() call is same as having 8 batches of size 128 and 8 backward() calls, with one step() call in the end).

To illustrate the idea, here is a simple example. We want to get our tensor x close to 40,50 and 60 simultaneously:

x = torch.tensor([1.0],requires_grad=True)
loss1 = criterion(40,x)
loss2 = criterion(50,x)
loss3 = criterion(60,x)

Now the first approach: (we use tensor.grad to get current gradient for our tensor x)

loss1.backward()
loss2.backward()
loss3.backward()

print(x.grad)

This outputs : tensor([-294.]) (EDIT: put retain_graph=True in first two backward calls for more complicated computational graphs)

The third approach:

loss = loss1+loss2+loss3
loss.backward()
print(x.grad)

Again the output is : tensor([-294.])

2nd approach is different because we don't call opt.zero_grad after calling step() method. This means in all 3 step calls gradients of first backward call is used. For example, if 3 losses provide gradients 5,1,4 for same weight, instead of having 10 (=5+1+4), now your weight will have 5*3+1*2+4*1=21 as gradient.

For further reading : Link 1,Link 2

Shihab Shahriar Khan
  • 4,930
  • 1
  • 18
  • 26
  • 1
    I used the third method,it's worked.Thank you for your patience and careful reply. – heiheihei Jan 02 '19 at 08:18
  • 1
    When I tried method 1, I need to put `retain_graph=True` in the first loss, otherwise there's error. Do you know why? – Raven Cheuk Apr 08 '19 at 00:35
  • 1
    yes, bcz python throws away the intermediate computation graph after `backward()` call, so later calls don't have any graph to backpropagate to. My simple example don't have any intermediate computation, so it wasn't a problem here. Made an edit now – Shihab Shahriar Khan Apr 08 '19 at 08:40
  • @ShihabShahriarKhan does the implementation guarantee that the backpropagation applies only for the respective components? For example ´loss1´ should only impact the conv 11x11 layer. Or is this explicitly known because of the gradient_fn associated with each component? – Pleastry Dec 16 '21 at 15:12
  • Also, what about loss2 in this case. Does it backpropagate to conv 11x11 or does it only impact conv 5x5 ? – Pleastry Dec 16 '21 at 15:29
5

-- Comment on first approach removed, see other answer --

Your second approach would require that you backpropagate with retain_graph=True, which incurs heavy computational costs. Moreover, it is wrong, since you would have updated the network weights with the first optimizer step, and then your next backward() call would compute the gradients prior to the update, which means that the second step() call would insert noise into your updates. If on the other hand you performed another forward() call to backpropagate through the updated weights, you would end up having an asynchronous optimization, since the first layers would be updated once with the first step(), and then once more for each subsequent step() call (not wrong per se, but inefficient and probably not what you wanted in the first place).

Long story short, the way to go is the last approach. Reduce each loss into a scalar, sum the losses and backpropagate the resulting loss. Side note; make sure your reduction scheme makes sense (e.g. if you are using reduction='sum' and the losses correspond to a multi-label classification, remember that the number of classes per objective is different, so the relative weight contributed by each loss would also be different)

KonstantinosKokos
  • 3,369
  • 1
  • 11
  • 21
  • 1
    I believe this might have few mistakes: [link1](https://discuss.pytorch.org/t/pytorch-gradients/884/2),[Link2](https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/20). Please let me know if I've myself made any error.... – Shihab Shahriar Khan Jan 01 '19 at 16:29
  • 1
    Good point, thanks for noticing and sorry for the wrong information-- I was certain that this was the case for some reason. – KonstantinosKokos Jan 01 '19 at 18:09
  • 1
    Thank you for your patience and careful answer.Your answer gives me some inspiration,but I'm dealing with objection detection problem,so I don't know what impact it will have on classification.Thanks again! – heiheihei Jan 02 '19 at 08:29
5

Third attempt is the best.

Two different loss functions

If you have two different loss functions, finish the forwards for both of them separately, and then finally you can do (loss1 + loss2).backward(). It’s a bit more efficient, skips quite some computation.

Extra tip: Sum the loss

In your code you want to do:

loss_sum += loss.item()

to make sure you do not keep track of the history of all your losses.

item() will break the graph and thus allow it to be freed from one iteration of the loop to the next. Also you could use detach() for the same.

prosti
  • 42,291
  • 14
  • 186
  • 151
  • Why would you want to sum the loss over multiple iterations? I don't see the added value of the extra tip – Tim Kuipers Jul 14 '23 at 12:45
  • 1
    @TimKuipers you might calculate the batch loss, add it together for each iteration and finally divide it by the number of batches to get an average epoch loss – mxcx Jul 31 '23 at 10:51
1

First and 3rd attempt are correct but not same.

It will compute the gradient flow of Conv 11x11 multi times if use first attempt, but only once use 3rd attempt.

The same to Conv 5x5, Conv 3x3 ... gradient computation.

xiaobao
  • 11
  • 3
  • if they are not same, how can they be correct? Can you please provide link or example supporting your argument? Thanks – Shihab Shahriar Khan Nov 28 '19 at 13:49
  • @ShihabShahriarKhan they create three different AD computational graphs, and thus is more complicated. I can provide you the [feedback](https://discuss.pytorch.org/t/how-to-weight-my-own-multiple-losses/25816/2?u=dejanbatanjac) from the major PyTorch developer. Check the word efficient. – prosti Sep 19 '20 at 13:28
0

The answer I had been looking for when I ended up here is the following:

y = torch.tensor([loss1, loss2, loss3])
y.backward(gradient=torch.tensor([1.0,1.0,1.0]))

See https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#gradients for confirmation.

A similar question exists but this one uses a different phrasing and was the question which I found first when hitting the issue. The similar question can be found at Pytorch. Can autograd be used when the final tensor has more than a single value in it?