3

The training dataset contains two classes A and B which we represent as 1 and 0 in our target labels correspondingly. Out labels data is heavily skewed towards class 0 which takes roughly 95% of the data while our class 1 is only 5%. How should we construct our loss function in such case?

I found Tensorflow has a function that can be used with weights:

tf.losses.sigmoid_cross_entropy

weights acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value.

Sounds good. I set weights to 2.0 to make loss higher and punish errors more.

loss = loss_fn(targets, cell_outputs, weights=2.0, label_smoothing=0)

However, not only the loss didn't go down it increased and the final accuracy on the dataset decreased slightly. Ok, maybe I misunderstood and it should be < 1.0, I tried a smaller number. This didn't change anything, I got almost the same loss and accuracy. O_o

Needless to say that same network trained on the same dataset but with loss weight 0.3 significantly reduces the loss up to x10 times in Torch / PyTorch.

Can somebody please explain how to use loss weights in Tensorflow?

minerals
  • 6,090
  • 17
  • 62
  • 107

2 Answers2

5

If you're scaling the loss with a scalar, like 2.0, then basically you're multiplying the loss and therefore the gradient for backpropagation. It's similar to increasing the learning rate, but not exactly the same, because you're also changing the ratio to regularization losses such as weight decay.

If your classes are heavily skewed, and you want to balance it at the calculation of loss, then you have to specify a tensor as weight, as described in the manual for tf.losses.sigmoid_cross_entropy():

weights: Optional Tensor whose rank is either 0, or the same rank as labels, and must be broadcastable to labels (i.e., all dimensions must be either 1, or the same as the corresponding losses dimension).

That is make the weights tensor 1.0 for class 0, and maybe 10 for class 1, and now "false negative" losses will be much more heavily counted.

It is an art how much you should over-weigh the underrepresented class. If you overdo it, the model will collapse and will predict the over-weighted class all the time.

An alternative to achieve the same thing is using tf.nn.weighted_cross_entropy_with_logits(), which has a pos_weight argument for the exact same purpose. But it's in tf.nn not tf.losses so you have to manually add it to the losses collection.

Generally another method to handle this is to arbitrarily increase the proportion of the underrepresented class at sampling. That should not be overdone either, however. You can do both of these things too.

Peter Szoldan
  • 4,792
  • 1
  • 14
  • 24
  • But this basically means constructing a minibatch of weight tensors because you have to look into targets minibatch and see what values are in there to use 10 when cell is 1 and 1.0 otherwise. – minerals Apr 13 '18 at 09:41
  • Yes, basically. Added reference to `tf.nn.weighted_cross_entropy_with_logits()` which does it for you more easily. – Peter Szoldan Apr 13 '18 at 09:45
  • does this mean every loss computed by tf.nn functions should be added to the losses collection? – whiletrue Apr 09 '19 at 02:05
  • Yes. Tensors resulting from the tf.losses.* collection will automatically become part of the losses collection. If you calculate a custom tensor you wish to add as a loss, you have to add it manually. – Peter Szoldan Apr 09 '19 at 13:53
1

You can set a penalty for misclassification of each sample. If weights is a tensor of shape [batch_size], the loss for each sample will be multiplied by the corresponding weight. So if you assign the same weight to all samples (which is the same as using a scalar weight), your loss will only be scaled by this scalar, and the accuracy should not change.

If you instead assign different weights for the minority class and the majority class, the contributions of the samples to the loss function will be different, and you should be able to influence the accuracy by choosing your weights differently.

A few scenarios (your choice will depend on what you need):

1.) If you want a good overall accuracy, it you could choose the weights of the majority class to be very large and the weights of the minority class much smaller. This will probably lead to a classification of all events into the majority class (i.e. 95 % of total classification accuracy, but the minority class will usually be classified into the wrong class.

2.) If your signal is the minority class and the background is the majority class, you probably want very little background contamination in your predicted signal, i.e. you want almost no background samples to be predicted as signal. This will also happen if you choose the majority weight much larger than the minority weight, but you might find that the network tends to predict all samples to be background. So you will not have any signal samples left. In this case you should consider a large weight for the minority class + an extra loss for background samples being classified as signal samples (false positives), like this:

loss = weighted_cross_entropy + extra_penalty_for_false_positives

ml4294
  • 2,559
  • 5
  • 24
  • 24