scale the loss value according to "badness" in caffe

Question

I want to scale the loss value of each image based on how close/far is the "current prediction" to the "correct label" during the training. For example if the correct label is "cat" and the network think it is "dog" the penalty (loss) should be less than the case if the network thinks it is a "car".

The way that I am doing is as following:

1- I defined a matrix of the distance between the labels,
2- pass that matrix as a bottom to the "softmaxWithLoss" layer,
3- multiply each log(prob) to this value to scale the loss according to badness in forward_cpu

However I do not know what should I do in the backward_cpu part. I understand the gradient (bottom_diff) has to be changed but not quite sure, how to incorporate the scale value here. According to the math I have to scale the gradient by the scale (because it is just an scale) but don't know how.

Also, seems like there is loosLayer in caffe called "InfoGainLoss" that does very similar job if I am not mistaken, however the backward part of this layer is a little confusing:

bottom_diff[i * dim + j] = scale * infogain_mat[label * dim + j] / prob;

I am not sure why infogain_mat[] is divide by prob rather than being multiply by! If I use identity matrix for infogain_mat isn't it supposed to act like softmax loss in both forward and backward?

It will be highly appreciated if someone can give me some pointers.

Shai · Answer 1 · 2017-07-17T09:41:19.200

You are correct in observing that the scaling you are doing for the log(prob) is exactly what "InfogainLoss" layer is doing (You can read more about it here and here).

As for the derivative (back-prop): the loss computed by this layer is

L = - sum_j infogain_mat[label * dim + j] * log( prob(j) )

If you differentiate this expression with respect to prob(j) (which is the input variable to this layer), you'll notice that the derivative of log(x) is 1/x this is why you see that

dL/dprob(j) = - infogain_mat[label * dim + j] / prob(j)

Now, why don't you see similar expression in the back-prop of "SoftmaxWithLoss" layer?
well, as the name of that layer suggests it is actually a combination of two layers: softmax that computes class probabilities from classifiers outputs and a log loss layer on top of it. Combining these two layer enables a more numerically robust estimation of the gradients.
Working a little with "InfogainLoss" layer I noticed that sometimes prob(j) can have a very small value leading to unstable estimation of the gradients.

Here's a detailed computation of the forward and backward passes of "SoftmaxWithLoss" and "InfogainLoss" layers with respect to the raw predictions (x), rather than the "softmax" probabilities derived from these predictions using a softmax layer. You can use these equations to create a "SoftmaxWithInfogainLoss" layer that is more numerically robust than computing infogain loss on top of a softmax layer:

PS,
Note that if you are going to use infogain loss for weighing, you should feed H (the infogain_mat) with label similarities, rather than distances.

Update:
I recently implemented this robust gradient computation and created this pull request. This PR was merged to master branch on April, 2017.

The derivative should be `dL/dprob(j) = - infogain_mat[label * dim + j] / prob(j)`. — nn0p, Jul 17 '17 at 09:22

scale the loss value according to "badness" in caffe

1 Answers1

Linked