How does PyTorch compute the backward pass when optimizing triplet loss?

Question

I am implementing a triplet network in Pytorch where the 3 instances (sub-networks) share the same weights. Since the weights are shared, I implemented it as a single instance network that is called three times to produce the anchor, positive, and negative embeddings. The embeddings are learned by optimizing the triplet loss. Here is a small snippet for illustration:

from dependencies import *
model = SingleSubNet() # represents each instance in the triplet net

for epoch in epochs:
        for anch, pos, neg in enumerate(train_loader):
                optimizer.zero_grad()
                fa, fp, fn = model(anch), model(pos), model(neg)
                loss = triplet_loss(fa, fp, fn)
                loss.backward()
                optimizer.step()
                # Do more stuff ...

My complete code works as expected. However, I do not understand what does the loss.backward() compute the gradient(s) in this case. I am confused because there are 3 gradients of loss is in each learning step (the gradients formulas are here). I assume the gradients are summed before performing optimizer.step(). But then it looks from the equations that if the gradients are summed, they will cancel each other out and yield zero update term. Of course, this is not true as the network learns meaningful embeddings at the end.

Thanks in advance

score 0 · Answer 1 · answered Sep 02 '22 at 04:56

Late answer, but hope this helps someone. The gradients that you linked are the gradients of the loss with respect to the embeddings (the anchor, positive embedding and negative embedding). To update the model parameters, you use the gradient of the loss with respect to the model parameters. This does not sum to zero.

The reason for this is that when calculating the gradient of the loss with respect to the model parameters, the formula makes use of the activations from the forward pass, and the 3 different inputs (anchor image, positive example and negative example) have different activations in the forward pass.

How does PyTorch compute the backward pass when optimizing triplet loss?

1 Answers1