12

I am trying to use caffe to implement triplet loss described in Schroff, Kalenichenko and Philbin "FaceNet: A Unified Embedding for Face Recognition and Clustering", 2015.

I am new to this so how to calculate the gradient in back propagation?

Shai
  • 111,146
  • 38
  • 238
  • 371
Mickey Shine
  • 12,187
  • 25
  • 96
  • 148
  • 1
    I see there is an open PR implementing this loss: https://github.com/BVLC/caffe/pull/3663 – Shai Feb 25 '16 at 06:38

1 Answers1

17

I assume you define the loss layer as

layer {
  name: "tripletLoss"
  type: "TripletLoss"
  bottom: "anchor"
  bottom: "positive"
  bottom: "negative"
  ...
}

Now you need to compute a gradient w.r.t each of the "bottom"s.

The loss is given by:
enter image description here

The gradient w.r.t the "anchor" input (fa):
enter image description here

The gradient w.r.t the "positive" input (fp):
enter image description here

The gradient w.r.t the "negative" input (fn):
![enter image description here


The original calculation (I leave here for sentimental reasons...)

enter image description here

Please see comment correcting the last term.

Community
  • 1
  • 1
Shai
  • 111,146
  • 38
  • 238
  • 371
  • 5
    The last one, gradient of "negative", shouldn't it be 2(fa - fn) ? – Mickey Shine Oct 27 '15 at 11:13
  • @MickeyShine you should look at the implementation of [`EucleadianLossLayer`](https://github.com/BVLC/caffe/blob/master/src/caffe/layers/euclidean_loss_layer.cpp) to see how these computations can be implemented in caffe. – Shai Oct 27 '15 at 11:17
  • sure, I'm gonna take a look – Mickey Shine Oct 27 '15 at 11:35
  • Shai, can you please clarify how to count gradient for anchor image as input if you have batch size of 64 for example? What i mean is in this batch we will have several negatives and positives and which one should i take to count sum(2(fa-fp)-2(fa-fn)) (which btw equal to sum(2(fn-fp))? or this gradient only works for 1 triplet, with 1 of each? as i understand Caffe we need to have 64 gradients for batch size of 64 on loss level, no? – loknar Nov 18 '15 at 17:44
  • 1
    @loknar the gradient is computed as a sum over `i`. This sum is over all examples in the batch. As you can see some examples contribute to the gradient (if they violate the margin) and some don't. – Shai Nov 19 '15 at 06:58
  • @Shai, and why then you need 3 formulas? Do i understand correctly that if we have for example 10 batch size, 4 positives (0,1,2,3 indexes), 6 negatives, we generate 6 triplets (from 6 pos-pos pairs (0-1, 0-2, 0-3, 1-2, 1-3, 2-3) + 1 matching negative per pair), 4 triplets for example violate margin, then we calculate every formula for each triplet (ie 3 times per triplet) and then sum all 12 scalars (4 triplets * 3 formulas) for back propagating 1 scalar? – loknar Nov 19 '15 at 08:31
  • @loknar you need to backprob different gradient to each `bottom` - this is why you have three different expressions. – Shai Nov 19 '15 at 08:34
  • 1
    Shouldn't the first gradient be `2(a-p) - 2(a-n)` or simplified `2(n-p)`? – hbaderts Dec 06 '16 at 23:57
  • 1
    @Shai This may be a dumb question but do all three CNNs share the same weights? (So backpropagation will average the three backpropagation and update them). If not(which seems like it), which CNN do we use on test time? – MoneyBall May 16 '17 at 07:01
  • 1
    @MoneyBall (1) not a dumb question. (2) it all boils down to how you are going to deploy your net. If you are after learning an embedding to feature space and thus you have a **single** embedding (i.e. a single CNN) than all copies must share weights during training. – Shai May 16 '17 at 07:09
  • @Shai I see. Since we have three different backpropagations, is averaging them the most common method? Or do you add them up? – MoneyBall May 16 '17 at 07:15
  • @MoneyBall once you define weight sharing ([using `name` in weights' `param`](http://caffe.berkeleyvision.org/gathered/examples/siamese.html)) caffe takes care for all the rest for you. – Shai May 16 '17 at 07:17
  • @Shai Ah i see. (1) Do you by any chance know what caffe does? (just out of curiosity). (2) As you may have noticed from answering all my posts, I've been studying CNN a lot lately, and you obviously seem to be an expert. I've been reading quite a few papers on similarity learning. I found many loss functions: contrastive, triplet, and lifted structure loss. They say they are doing metric learning but aren't they doing embedding learning? As in they are trying to find the suitable representation in feature space with euclidean distance metric? – MoneyBall May 16 '17 at 07:32
  • @MoneyBall (1) I'm not sure but AFAIK caffe averages the weights. (2) these methods seem to learn an embedding into a metric space, so you might call them metric learning. Have you looked at [*Tadmor et al*, **Learning a Metric Embedding for Face Recognition using the Multibatch Method** (2016)](https://arxiv.org/abs/1605.07270)? I find this approach very interesting and efficient. – Shai May 16 '17 at 07:50
  • @Shai I'll definitely take a look at it. Thank you. – MoneyBall May 16 '17 at 08:01
  • @Shai can you please explain how these three gradient computations aid in adjusting the embeddings of the network such that positive is close and negative is far – shaifali Gupta Dec 23 '18 at 17:42
  • @shaifaliGupta it's quite self explanatory: if the positive is indeed closer to the anchor than the negative than gradient is zero - no need to make any change. Otherwise the gradients push the positive point in the direction of the anchor, push the negative point away from the anchor and moves the anchor itself on the direction from negative to positive towards the positive – Shai Dec 23 '18 at 17:47
  • @Shai I am sorry. I am unable to visualize it. What I can make out of the equations is that each says: f^n=f^p, f^a=f^p and f^n=f^a – shaifali Gupta Dec 23 '18 at 17:54
  • @Shai Based off your work I have derived the backpropagation for a Contrastive loss function (trying to replicate caffe's implementation but with custom sampling). It looks like the gradient w.r.t. inputs "A" and "B" (outputs of ip2 that I have 'sliced') are just negatives of each other. Given that siamese networks share weights, how would this help at all? Wouldn't they cancel out? Or do we simply ignore one of these gradients? Thanks. – user2066337 May 07 '19 at 19:30
  • @user2066337 - I'll need to see the loss and the derivatives to answer that question. Comments are **not** the scope for such inquiries. Please consider asking a new question. – Shai May 08 '19 at 06:17