3

Assuming after performing median frequency balancing for images used for segmentation, we have these class weights:

class_weights = {0: 0.2595,
                 1: 0.1826,
                 2: 4.5640,
                 3: 0.1417,
                 4: 0.9051,
                 5: 0.3826,
                 6: 9.6446,
                 7: 1.8418,
                 8: 0.6823,
                 9: 6.2478,
                 10: 7.3614,
                 11: 0.0}

The idea is to create a weight_mask such that it could be multiplied by the cross entropy output of both classes. To create this weight mask, we can broadcast the values based on the ground_truth labels or the predictions. Some mathematics in my implementation:

  1. Both labels and logits are of shape [batch_size, height, width, num_classes]

  2. The weight mask is of shape [batch_size, height, width, 1]

  3. The weight mask is broadcasted to the num_classes number of channels of the multiplication between the softmax of the logit and the labels to give an output shape of [batch_size, height, width, num_classes]. In this case, num_classes is 12.

  4. Reduce sum for each example in a batch, then perform reduce mean for all examples in one batch to get a single scalar value of loss.

In this case, should we create the weight mask based on the predictions or the ground truth?

If we build it based on the ground_truth, then it means no matter what the predicted pixel labels are, they get penalized based on the actual labels of the class, which doesn't seem to guide the training in a sensible way.

But if we build it based on the predictions, then for whatever logit predictions that are produced, if the predicted label (from taking the argmax of the logit) is dominant, then the logit values for that pixel will all be reduced by a significant amount.

--> Although this means the maximum logit will still be the maximum since all of the logits in the 12 channels will be scaled by the same value, the final softmax probability of the label predicted (which is still the same before and after scaling), will be lower than before scaling (did some simple math to estimate). --> a lower loss is predicted

But the problem is this: If a lower loss is predicted as a result of this weighting, then wouldn't it contradict the idea that predicting dominant labels should give you a greater loss?

The impression I get in total for this method is that:

  1. For the dominant labels, they are penalized and rewarded much lesser.
  2. For the less dominant labels, they are rewarded highly if the predictions are correct, but they're also penalized heavily for a wrong prediction.

So how does this help to tackle the issue of class-balancing? I don't quite get the logic here.


IMPLEMENTATION

Here is my current implementation for calculating the weighted cross entropy loss, although I'm not sure if it is correct.

def weighted_cross_entropy(logits, onehot_labels, class_weights):
    if not logits.dtype == tf.float32:
        logits = tf.cast(logits, tf.float32)

    if not onehot_labels.dtype == tf.float32:
        onehot_labels = tf.cast(onehot_labels, tf.float32)

    #Obtain the logit label predictions and form a skeleton weight mask with the same shape as it
    logit_predictions = tf.argmax(logits, -1) 
    weight_mask = tf.zeros_like(logit_predictions, dtype=tf.float32)

    #Obtain the number of class weights to add to the weight mask
    num_classes = logits.get_shape().as_list()[3]

    #Form the weight mask mapping for each pixel prediction
    for i in xrange(num_classes):
        binary_mask = tf.equal(logit_predictions, i) #Get only the positions for class i predicted in the logits prediction
        binary_mask = tf.cast(binary_mask, tf.float32) #Convert boolean to ones and zeros
        class_mask = tf.multiply(binary_mask, class_weights[i]) #Multiply only the ones in the binary mask with the specific class_weight
        weight_mask = tf.add(weight_mask, class_mask) #Add to the weight mask

    #Multiply the logits with the scaling based on the weight mask then perform cross entropy
    weight_mask = tf.expand_dims(weight_mask, 3) #Expand the fourth dimension to 1 for broadcasting
    logits_scaled = tf.multiply(logits, weight_mask)

    return tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits_scaled)

Could anyone verify whether my concept of this weighted loss is correct, and whether my implementation is correct? This is my first time getting acquainted with a dataset with imbalanced class, and so I would really appreciate it if anyone could verify this.

TESTING RESULTS: After doing some tests, I found the implementation above results in a greater loss. Is this supposed to be the case? i.e. Would this make the training harder but produce a more accurate model eventually?


SIMILAR THREADS

Note that I have checked a similar thread here: How can I implement a weighted cross entropy loss in tensorflow using sparse_softmax_cross_entropy_with_logits

But it seems that TF only has a sample-wise weighting for loss but not a class-wise one.

Many thanks to all of you.

kwotsin
  • 2,882
  • 9
  • 35
  • 62
  • "If we build it based on the ground_truth, then it means no matter what the predicted pixel labels are, they get penalized based on the actual labels of the class, which doesn't seem to guide the training in a sensible way." Why is that? – P-Gn Jun 09 '17 at 09:36
  • Meaning to say if a certain pixel [x,y] is supposed to be labelled 1, but the predictions can be anything from 0 to 11, then regardless of what the prediction is given for that label, the scaling for the specific pixel applied to the logits will be the same no matter what logit prediction it is. I thought this would be weird given that we want to adaptively penalize the predicted labels. Do you have some insights into this? – kwotsin Jun 09 '17 at 09:53

1 Answers1

2

Here is my own implementation in Keras using the TensorFlow backend:

def class_weighted_pixelwise_crossentropy(target, output):
    output = tf.clip_by_value(output, 10e-8, 1.-10e-8)
    with open('class_weights.pickle', 'rb') as f:
        weight = pickle.load(f)
    return -tf.reduce_sum(target * weight * tf.log(output))

where weight is just a standard Python list with the indexes of the weights matched to those of the corresponding class in the one-hot vectors. I store the weights as a pickle file to avoid having to recalculate them. It is an adaptation of the Keras categorical_crossentropy loss function. The first line simply clips the value to make sure we never take the log of 0.

I am unsure why one would calculate the weights using the predictions rather than the ground truth; if you provide further explanation I can update my answer in response.

Edit: Play around with this numpy code to understand how this works. Also review the definition of cross entropy.

import numpy as np

weights = [1,2]

target = np.array([ [[0.0,1.0],[1.0,0.0]],
                    [[0.0,1.0],[1.0,0.0]]])

output = np.array([ [[0.5,0.5],[0.9,0.1]],
                    [[0.9,0.1],[0.4,0.6]]])

crossentropy_matrix = -np.sum(target * np.log(output), axis=-1)
crossentropy = -np.sum(target * np.log(output))
Jessica Alan
  • 690
  • 1
  • 7
  • 11
  • Mah I know what are the shapes of your input? Would this function work for the case of 4 dimensions? Actually for the weights calculation, I'm not very sure about it as well, so I'm guessing it could be either based on the predictions or the ground truth. Do you have any further reference where I could read up to understand why it should be based on the ground truth instead? Also, do you know of any implementations of median frequency balancing? – kwotsin Jun 09 '17 at 16:57
  • It's the outputs that matter. The inputs are RGB images of shape (1024, 512, 3) and the outputs are annotations of shape (1024, 512, 1). The function should work for outputs of any rank. – Jessica Alan Jun 09 '17 at 17:30
  • I believe if you're handling a batch of image, aka rank 4, you should use reduce mean at the end as well. In your case, are your classes only 1 or 0? I've always thought for multi label pixel classification the output should have num classes channels. I don't quite get how the broadcasting for target * weight take place of target is rank 4 but weight is just rank 1 - how will each pixel knows what weight to be assigned? – kwotsin Jun 09 '17 at 18:33
  • Apologies, the output is actually a 2d matrix of one-hot vectors, shape (1024, 512, 34). There are 34 classes. You can use reduce_mean or reduce_sum, these will change the magnitude of the loss but not its gradient. Review the definition of crossentropy - if our true probability distribution is (e.g.) [0,1,0] and our prediction is [0.1, 0.7, 0.2], log(0.1) and log(0.2) get multiplied by zero and thus do not contribute to the loss. Only the prediction for the true class contributes. As for the broadcasting, I have added some sample numpy code to my answer for you to play with. – Jessica Alan Jun 09 '17 at 19:09
  • You are right. The numpy broadcasting (and indeed tf broadcasting) can be much simpler like in your implementation. I'm wondering, should I multiply the logits with the class weights first or should I multiply the class weights with the softmaxed logits? What is the reasoning behind multiplying the class weights to the softmax layer instead? Also, do you know what are some conventional ways of calculating the class weights? I don't particularly get how median frequeuncy balancing works but there isn't much documentation about it online (it's a fairly recent idea). – kwotsin Jun 13 '17 at 10:26
  • I cannot say; in my own implementation I simply summed the occurrences of each class and divided the total number of pixels by that number to get the weight for that class. In practice, this caused the network to be overly sensitive to noise, and to drastically overestimate the occurrences of infrequent classes. I have abandoned my experiments with class weighting for now, but in the future I might retry it after quashing the weights through a function such as sigmoid in order to bring them closer together. – Jessica Alan Jun 13 '17 at 15:56
  • In your experience with the class weighting, did you encounter a diverging loss? It seems that using the method you suggested, which from testing on some small samples it looks perfectly fine, I have gotten diverging losses from all tries regardless of my class weights. One very possible reason is that if we multiply the class weights against the labels, we might have wrong loss calculated. e.g. label is 0.7 but the prediction is 0.9, which messes up the cross entropy calculation. Should we multiply the weights against predictions instead? Again, are there any references for this method? – kwotsin Jun 20 '17 at 07:16