Tackling Class Imbalance: scaling contribution to loss and sgd

Question

(An update to this question has been added.)

I am a graduate student at the university of Ghent, Belgium; my research is about emotion recognition with deep convolutional neural networks. I'm using the Caffe framework to implement the CNNs.

Recently I've run into a problem concerning class imbalance. I'm using 9216 training samples, approx. 5% are labeled positively (1), the remaining samples are labeled negatively (0).

I'm using the SigmoidCrossEntropyLoss layer to calculate the loss. When training, the loss decreases and the accuracy is extremely high after even a few epochs. This is due to the imbalance: the network simply always predicts negative (0). (Precision and recall are both zero, backing this claim)

To solve this problem, I would like to scale the contribution to the loss depending on the prediction-truth combination (punish false negatives severely). My mentor/coach has also advised me to use a scale factor when backpropagating through stochastic gradient descent (sgd): the factor would be correlated to the imbalance in the batch. A batch containing only negative samples would not update the weights at all.

I have only added one custom-made layer to Caffe: to report other metrics such as precision and recall. My experience with Caffe code is limited but I have a lot of expertise writing C++ code.

Could anyone help me or point me in the right direction on how to adjust the SigmoidCrossEntropyLoss and Sigmoid layers to accomodate the following changes:

adjust the contribution of a sample to the total loss depending on the prediction-truth combination (true positive, false positive, true negative, false negative).
scale the weight update performed by stochastic gradient descent depending on the imbalance in the batch (negatives vs. positives).

Thanks in advance!

Update

I have incorporated the InfogainLossLayer as suggested by Shai. I've also added another custom layer that builds the infogain matrix H based on the imbalance in the current batch.

Currently, the matrix is configured as follows:

H(i, j) = 0          if i != j
H(i, j) = 1 - f(i)   if i == j (with f(i) = the frequency of class i in the batch)

I'm planning on experimenting with different configurations for the matrix in the future.

I have tested this on a 10:1 imbalance. The results have shown that the network is learning useful things now: (results after 30 epochs)

Accuracy is approx. ~70% (down from ~97%);
Precision is approx. ~20% (up from 0%);
Recall is approx. ~60% (up from 0%).

These numbers were reached at around 20 epochs and didn't change significantly after that.

!! The results stated above are merely a proof of concept, they were obtained by training a simple network on a 10:1 imbalanced dataset. !!

well done! can you elaborate more about the custom layer you added to compute `H` per batch? — Shai, May 31 '15 at 05:21
Sure, it's quite simple. The layer takes one blob as input: the ground truth labels for that batch; and produces one blob as output: the infogain matrix `H`. The frequencies for each class are calculated based on the labels, then the matrix is filled based on the formula mentioned in the update (I don't claim that to be _the one and only_ formula that works, I'm planning on experimenting with different values). — Maarten Bamelis, May 31 '15 at 09:43
@RockridgeKid I haven't made my adjustments publicly available since they are more of a hack-around and not a real improvement to the Caffe codebase. — Maarten Bamelis, Oct 25 '15 at 15:36
@MaartenBamelis Two questions: 1. did you try giving samples different weight instead of changing the loss type, as shown [here](http://deepdish.io/2014/11/04/caffe-with-weighted-samples/)? 2. It seems obvious but to confirm, you didn't have to implement backward computation for the `H` matrix computation layer, right? — Autonomous, Aug 07 '16 at 00:08
@ParagS.Chandakkar Thank you for your questions! To answer: 1. I did not weight the samples; 2. There is no backward computation for the H matrix layer. — Maarten Bamelis, Aug 07 '16 at 00:23
@MaartenBamelis Last question: If you shuffled the data randomly (which I assume you must have), then did you really need to compute `H` at every batch? I mean, couldn't you assume that you have roughly the same proportion of positive and negative samples in each batch? Am I missing something here? — Autonomous, Aug 07 '16 at 04:19
@ParagS.Chandakkar I did shuffle the data randomly but the probability to get one or more batches that only have negative samples was fairly large. I could compute H beforehand and use fixed values but it did not really cross my mind. I really wanted the network to learn from the low amount of positive samples I had. — Maarten Bamelis, Aug 07 '16 at 09:11

score 20 · Accepted Answer · edited Jun 20 '20 at 09:12

20

Why don't you use the InfogainLoss layer to compensate for the imbalance in your training set?

The Infogain loss is defined using a weight matrix H (in your case 2-by-2) The meaning of its entries are

[cost of predicting 1 when gt is 0,    cost of predicting 0 when gt is 0
 cost of predicting 1 when gt is 1,    cost of predicting 0 when gt is 1]

So, you can set the entries of H to reflect the difference between errors in predicting 0 or 1.

You can find how to define matrix H for caffe in this thread.

Regarding sample weights, you may find this post interesting: it shows how to modify the SoftmaxWithLoss layer to take into account sample weights.

Recently, a modification to cross-entropy loss was proposed by Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár Focal Loss for Dense Object Detection, (ICCV 2017).
The idea behind focal-loss is to assign different weight for each example based on the relative difficulty of predicting this example (rather based on class size etc.). From the brief time I got to experiment with this loss, it feels superior to "InfogainLoss" with class-size weights.

edited Jun 20 '20 at 09:12

Community

1
1

answered May 28 '15 at 05:28

Shai

111,146
38
238
371

Great answer, thank you very much. I didn't know about this layer, it seems like it provides the features I need. I'll check out all the information you linked and try to incorporate it into my architecture. I'll accept your answer once it has proven to help solve my class imbalance. – Maarten Bamelis May 28 '15 at 07:25
1

@MaartenBamelis I myself am facing some difficulty with learning models with class imbalance. I would appreciate it if you could update on your progress and how you overcome this difficulty. Thanks! – Shai May 28 '15 at 07:27
Are you sure the configuration of the matrix `H` is as you've written it? My interpretation of the [formula on this page](http://caffe.berkeleyvision.org/doxygen/classcaffe_1_1InfogainLossLayer.html#details) makes it more logical to have `cost of predicting 0, ground truth 0` in the top-left corner and `cost of predicting 1, ground truth 1` in the bottom-right corner (having the classes in ascending order on both the rows and columns). – Maarten Bamelis May 29 '15 at 17:56
1

I have updated the question with the results of applying your solution. The `InfogainLossLayer` is a good tool to help solve class imbalance! The results as stated in the update are merely a proof of concept; instead of just predicting `0`, the network **is learning** to solve the task! – Maarten Bamelis May 30 '15 at 14:26
1

@Shai, could you confirm if the meaning of the entries is as you described. The implementation is more aligned with Maarten Bamelis 's logic, unless I'm missing something. – ypx Aug 10 '16 at 12:42
1

This is literally the only place I could find that actually shows what H looks like. Thanks. – Alex Sep 06 '17 at 17:38
I + @ypx comment, where did you find that (0,0) entry in the matrix is cost of predicting 1 when gt is 0? This is not obvious at all – Alex Sep 11 '17 at 11:18
@Alex look at the way the loss is computed. – Shai Sep 11 '17 at 11:38
@Alex it's the math. You don't need to go to the source for that. The code only implements the algorithmic ideas in the math. – Shai Sep 13 '17 at 10:56
What I find counterintuitive is that Cost(predict 1|truth=0) is H[0,0] when it should be H[0,1] – Alex Sep 16 '17 at 20:24
@MaartenBamelis: how did you resolve the issue with the inputs of H? For example, if you have 3 classes, which loss does H[1,1] increase? – Alex Sep 16 '17 at 20:34
1

@Alex I carefully studied [the documentation about the InfoGainLossLayer](http://caffe.berkeleyvision.org/doxygen/classcaffe_1_1InfogainLossLayer.html#details); it states that if the infogain matrix `H` is the identity matrix, the loss layer behaves as the MultinomialLogisticLossLayer. So as far as I understood, this means that `H[1, 1]` applies to the situation when class `1` is predicted when the ground truth is also class `1`. – Maarten Bamelis Sep 18 '17 at 07:48
1

@MaartenBamelis: Thanks Maarten, I had to actually look into the source C++ code to understand how it works,. My real confusion was with the flattening of the softmax prob array and multiplication loop by H[i,j]. Once I understood this, the rest was easy. Thanks again. – Alex Sep 18 '17 at 09:31

score 0 · Answer 2 · answered Dec 04 '17 at 03:55

I have also come across this class imbalance problem in my classification task. Right now I am using CrossEntropyLoss with weight (documentation here) and it works fine. The idea is to give more loss to samples in classes with smaller number of images.

Calculating the weight

weight for each class in inversely proportional to the image number in this class. Here is a snippet to calculate weight for all class using numpy,

cls_num = []
# train_labels is a list of class labels for all training samples
# the labels are in range [0, n-1] (n classes in total)
train_labels = np.asarray(train_labels)
num_cls = np.unique(train_labels).size

for i in range(num_cls):
    cls_num.append(len(np.where(train_labels==i)[0]))

cls_num = np.array(cls_num)

cls_num = cls_num.max()/cls_num
x = 1.0/np.sum(cls_num)

# the weight is an array which contains weight to use in CrossEntropyLoss
# for each class.
weight = x*cls_num

Tackling Class Imbalance: scaling contribution to loss and sgd

Update

2 Answers2

Calculating the weight

Linked