Neural Network - Working with a imbalanced dataset

Question

I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem).

The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'.

Considering the big number of training samples I have, I didn't consider SVM. I also read about SMOTE for Random Forests. However, I was wondering whether NN could be efficient to handle this kind of imbalanced dataset with a large dataset ?

Also, as I am using Tensorflow to design the model, which characteristics should/could I tune to be able to handle this imbalanced situation ?

Thanks for your help ! Paul

Update :

Considering the number of answers, and that they are quite similar, I will answer all of them here, as a common answer.

1) I tried during this weekend the 1st option, increasing the cost for the positive label. Actually, with less unbalanced proportion (like 1/10, on another dataset), this seems to help a bit to get a better result, or at least to 'bias' the precision/recall scores proportion. However, for my situation, It seems to be very sensitive to the alpha number. With alpha = 250, which is the proportion of the unbalanced dataset, I have a precision of 0.006 and a recall score of 0.83, but the model is predicting way too many 1 that it should be - around 0.50 of label '1' ... With alpha = 100, the model predicts only '0'. I guess I'll have to do some 'tuning' for this alpha parameter :/ I'll take a look at this function from TF too as I did it manually for now : tf.nn.weighted_cross_entropy_with_logitsthat

2) I will try to de-unbalance the dataset but I am afraid that I will lose a lot of info doing that, as I have millions of samples but only ~ 100k positive samples.

3) Using a smaller batch size seems indeed a good idea. I'll try it !

Are you expecting the network to encounter the '0' label input data more often when applied after fitting the network? If so, you might be interested in having the unbalance because this would incite the network to learn those input patterns the best — jorgenkg, Jul 30 '16 at 05:42
Thanks for your comment ! The proportion of label '0' and '1' is the same in the training set and in the set I will have to predict. However, I don't really understand the second part of your message : "If so, you might be interested in having the unbalance because this would incite the network to learn those input patterns the best " ? — Paul Rolin, Aug 01 '16 at 15:10

score 4 · Answer 1 · answered Jul 29 '16 at 21:34

There are usually two common ways for imbanlanced dataset:

Online sampling as mentioned above. In each iteration you sample a class-balanced batch from the training set.
Re-weight the cost of two classes respectively. You'd want to give the loss on the dominant class a smaller weight. For example this is used in the paper Holistically-Nested Edge Detection

score 3 · Answer 2 · answered Jul 29 '16 at 21:53

I will expand a bit on chasep's answer. If you are using a neural network followed by softmax+cross-entropy or Hinge Loss you can as @chasep255 mentionned make it more costly for the network to misclassify the example that appear the less.
To do that simply split the cost into two parts and put more weights on the class that have fewer examples.
For simplicity if you say that the dominant class is labelled negative (neg) for softmax and the other the positive (pos) (for Hinge you could exactly the same):

 L=L_{neg}+L_{pos} =>L=L_{neg}+\alpha*L_{pos}

With \alpha greater than 1.

Which would translate in tensorflow for the case of cross-entropy where the positives are labelled [1, 0] and the negatives [0,1] to something like :

cross_entropy_mean=-tf.reduce_mean(targets*tf.log(y_out)*tf.constant([alpha, 1.]))

Whatismore by digging a bit into Tensorflow API you seem to have a tensorflow function tf.nn.weighted_cross_entropy_with_logitsthat implements it did not read the details but look fairly straightforward.

Another way if you train your algorithm with mini-batch SGD would be make batches with a fixed proportion of positives. I would go with the first option as it is slightly easier to do with TF.

Damn it ppwwyyxx just posted the exact same thing, will delete my answer if needs be ! — jeandut, Jul 29 '16 at 21:54
Your answer is complementary to ppzzyyxx one, since you provide the TF method name. — Hassen, May 22 '18 at 08:25

score 0 · Answer 3 · answered Jul 29 '16 at 18:11

One thing I might try is weighting the samples differently when calculating the cost. For instance maybe divide the cost by 250 if the expected result is a 0 and leave it alone if the expected result is a one. This way the more rare samples have more of an impact. You could also simply try training it without any changes and see if the nnet just happens to work. I would make sure to use a large batch size though so you always get at least one of the rare samples in each batch.

score 0 · Answer 4 · answered Jul 29 '16 at 22:05

0

Yes - neural network could help in your case. There are at least two approaches to such problem:

Leave your set not changed but decrease the size of batch and number of epochs. Apparently this might help better than keeping the batch size big. From my experience - in the beginning network is adjusting its weights to assign the most probable class to every example but after many epochs it will start to adjust itself to increase performance on all dataset. Using cross-entropy will give you additional information about probability of assigning 1 to a given example (assuming your network has sufficient capacity).
Balance your dataset and adjust your score during evaluation phase using Bayes rule:score_of_class_k ~ score_from_model_for_class_k / original_percentage_of_class_k.
You may reweight your classes in the cost function (as mentioned in one of the answers). Important thing then is to also reweight your scores in your final answer.

answered Jul 29 '16 at 22:05

Marcin Możejko

39,542
10
109
120

Why is it that you would use a smaller batch size? I would think you would want a larger one since it will include more of the rare sample and give a more accurate representation of the gradient. – chasep255 Jul 30 '16 at 00:05
1

Because once your network will learn that majority of your data belongs to one class a batch full of 0s will not affect your network. If the batch is smaller then if you get at least one example of 1s in it, it will influence your network much more than in a big natch. – Marcin Możejko Jul 30 '16 at 08:59
Thanks for your answer ! What do you mean by "Important thing then is to also reweight your scores in your final answer." ? – Paul Rolin Aug 01 '16 at 16:08
It means that if original class percentage was p1 but in your training set you had p2 (due to dataset balance you may assume that this value will be 0.5) then you need to resize your score using rule : score * p1 / p2. – Marcin Możejko Aug 03 '16 at 22:35

score 0 · Answer 5 · answered Mar 17 '21 at 21:41

I'd suggest a slightly different approach. When it comes to image data, the deep learning community has already come up with a few ways to augment data. Similar to image augmentation, you could try to generate fake data to "balance" your dataset. The approach I tried was to use a Variational Autoencoder and then sample from the underlying distribution to generate fake data for the class you want. I tried it and the results are looking pretty cool: https://lschmiddey.github.io/fastpages_/2021/03/17/data-augmentation-tabular-data.html

Neural Network - Working with a imbalanced dataset

5 Answers5

Linked