Editing TensorFlow Source to fix unbalanced data

Question

I have highly unbalanced data in a two class problem that I am trying to use TensorFlow to solve with a NN. I was able to find a posting that exactly described the difficulty that I'm having and gave a solution which appears to address my problem. However I'm working with an assistant, and neither of us really knows python and so TensorFlow is being used like a black box for us. I have extensive (decades) of experience working in a variety of programming languages in various paradigms. That experience allows me to have a pretty good intuitive grasp of what I see happening in the code my assistant cobbled together to get a working model, but neither of us can follow what is going on enough to be able to tell exactly where in TensorFlow we need to make edits to get what we want.

I'm hoping someone with a good knowledge of Python and TensorFlow can look at this and just tell us something like, "Hey, just edit the file called xxx and at the lines at yyy," so we can get on with it.

Below, I have a link to the solution we want to implement, and I've also included the code my assistant wrote that initially got us up and running. Our code produces good results when our data is balanced, but when highly imbalanced, it tends to classify everything skewed to the larger class to get better results.

Here is a link to the solution we found that looks promising:

Loss function for class imbalanced binary classifier in Tensor flow

I've included the relevant code from this link below. Since I know that where we make these edits will depend on how we are using TensorFlow, I've also included our implementation immediately under it in the same code block with appropriate comments to make it clear what we want to add and what we are currently doing:

# Here is the stuff we need to add some place in the TensorFlow source code:
ratio = 31.0 / (500.0 + 31.0)
class_weight = tf.constant([[ratio, 1.0 - ratio]])
logits = ... # shape [batch_size, 2]

weight_per_label = tf.transpose( tf.matmul(labels
                       , tf.transpose(class_weight)) ) #shape [1, batch_size]
# this is the weight for each datapoint, depending on its label

xent = tf.mul(weight_per_label
     , tf.nn.softmax_cross_entropy_with_logits(logits, labels, name="xent_raw") #shape [1, batch_size]
loss = tf.reduce_mean(xent) #shape 1



# NOW HERE IS OUR OWN CODE TO SHOW HOW WE ARE USING TensorFlow:
# (Obviously this is not in the same file in real life ...)


import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import tensorflow as tf
import numpy as np
from math import exp
from PreProcessData import load_and_process_training_Data,         
load_and_process_test_data
from PrintUtilities import printf, printResultCompare
tf.set_random_seed(0)



#==============================================================
# predefine file path

''' Unbalanced Training Data, hence there are 1:11 target and nontarget '''
targetFilePath = '/Volumes/Extend/BCI_TestData/60FeaturesVersion/Train1-35/tar.txt'
nontargetFilePath = '/Volumes/Extend/BCI_TestData/60FeaturesVersion/Train1-35/nontar.txt'
testFilePath = '/Volumes/Extend/BCI_TestData/60FeaturesVersion/Test41/feats41.txt'
labelFilePath = '/Volumes/Extend/BCI_TestData/60FeaturesVersion/Test41/labs41.txt'

# train_x,train_y = 
load_and_process_training_Data(targetFilePath,nontargetFilePath)
train_x, train_y = 
load_and_process_training_Data(targetFilePath,nontargetFilePath)
# test_x,test_y = load_and_process_test_data(testFilePath,labelFilePath)
test_x, test_y = load_and_process_test_data(testFilePath,labelFilePath)
# trained neural network path
save_path = "nn_saved_model/model.ckpt"

# number of classes
n_classes = 2 # in this case, target or non_target

# number of hidden layers
num_hidden_layers = 1

# number of nodes in each hidden layer
nodes_in_layer1 = 40
nodes_in_layer2 = 100
nodes_in_layer3 = 30  # We think: 3 layers is dangerous!! try to avoid it!!!!

# number of data features in each blocks
block_size = 3000 # computer may not have enough memory, so we divide the train into blocks

# number of times we iterate through training data
total_iterations = 1000

# terminate training if computed loss < supposed loss
expected_loss = 0.1

# max learning rate and min learnign rate
max_learning_rate = 0.002
min_learning_rate = 0.0002


# These are placeholders for some values in graph
# tf.placeholder(dtype, shape=None(optional), name=None(optional))
# It's a tensor to hold our datafeatures
x = tf.placeholder(tf.float32, [None,len(train_x[0])])
# Every row has either [1,0] for targ or [0,1] for non_target. placeholder to hold one hot value
Y_C = tf.placeholder(tf.int8, [None, n_classes])
# variable learning rate
lr = tf.placeholder(tf.float32)





# neural network model
def neural_network_model(data):
    if (num_hidden_layers == 1):
        # layers contain weights and bias for case like all neurons fired a 0 into the layer, we will need result out
        # When using RELUs, make sure biases are initialised with small *positive* values for example 0.1 = tf.ones([K])/10
        hidden_1_layer = {'weights': tf.Variable(tf.random_normal([len(train_x[0]), nodes_in_layer1])),
                      'bias': tf.Variable(tf.ones([nodes_in_layer1]) / 10)}

        # no more bias when come to the output layer
        output_layer = {'weights': tf.Variable(tf.random_normal([nodes_in_layer1, n_classes])),
                    'bias': tf.Variable(tf.zeros([n_classes]))}

        # multiplication of the raw input data multipled by their unique weights (starting as random, but will be optimized)
        l1 = tf.add(tf.matmul(data, hidden_1_layer['weights']), hidden_1_layer['bias'])
        l1 = tf.nn.relu(l1)

        # We repeat this process for each of the hidden layers, all the way down to our output, where we have the final values still being the multiplication of the input and the weights, plus the output layer's bias values.
        Ylogits = tf.matmul(l1, output_layer['weights']) + output_layer['bias']

    if (num_hidden_layers == 2):
        # layers contain weights and bias for case like all neurons fired a 0 into the layer, we will need result out
        # When using RELUs, make sure biases are initialised with small *positive* values for example 0.1 = tf.ones([K])/10
        hidden_1_layer = {'weights': tf.Variable(tf.random_normal([len(train_x[0]), nodes_in_layer1])),
                      'bias': tf.Variable(tf.ones([nodes_in_layer1]) / 10)}
        hidden_2_layer = {'weights': tf.Variable(tf.random_normal([nodes_in_layer1, nodes_in_layer2])),
                      'bias': tf.Variable(tf.ones([nodes_in_layer2]) / 10)}

        # no more bias when come to the output layer
        output_layer = {'weights': tf.Variable(tf.random_normal([nodes_in_layer2, n_classes])),
                    'bias': tf.Variable(tf.zeros([n_classes]))}

        # multiplication of the raw input data multipled by their unique weights (starting as random, but will be optimized)
        l1 = tf.add(tf.matmul(data, hidden_1_layer['weights']), hidden_1_layer['bias'])
        l1 = tf.nn.relu(l1)

        l2 = tf.add(tf.matmul(l1, hidden_2_layer['weights']), hidden_2_layer['bias'])
        l2 = tf.nn.relu(l2)

        # We repeat this process for each of the hidden layers, all the way down to our output, where we have the final values still being the multiplication of the input and the weights, plus the output layer's bias values.
        Ylogits = tf.matmul(l2, output_layer['weights']) + output_layer['bias']

    if (num_hidden_layers == 3):
        # layers contain weights and bias for case like all neurons fired a 0 into the layer, we will need result out
        # When using RELUs, make sure biases are initialised with small *positive* values for example 0.1 = tf.ones([K])/10
        hidden_1_layer = {'weights':tf.Variable(tf.random_normal([len(train_x[0]), nodes_in_layer1])), 'bias':tf.Variable(tf.ones([nodes_in_layer1]) / 10)}
        hidden_2_layer = {'weights':tf.Variable(tf.random_normal([nodes_in_layer1, nodes_in_layer2])), 'bias':tf.Variable(tf.ones([nodes_in_layer2]) / 10)}
        hidden_3_layer = {'weights':tf.Variable(tf.random_normal([nodes_in_layer2, nodes_in_layer3])), 'bias':tf.Variable(tf.ones([nodes_in_layer3]) / 10)}

        # no more bias when come to the output layer
        output_layer = {'weights':tf.Variable(tf.random_normal([nodes_in_layer3, n_classes])), 'bias':tf.Variable(tf.zeros([n_classes]))}

        # multiplication of the raw input data multipled by their unique weights (starting as random, but will be optimized)
        l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['bias'])
        l1 = tf.nn.relu(l1)

        l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['bias'])
        l2 = tf.nn.relu(l2)

        l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['bias'])
        l3 = tf.nn.relu(l3)

        # We repeat this process for each of the hidden layers, all the way down to our output, where we have the final values still being the multiplication of the input and the weights, plus the output layer's bias values.
        Ylogits = tf.matmul(l3,output_layer['weights']) + output_layer['bias']

    return Ylogits # return the neural network model

# set up the training process
def train_neural_network(x):

    # produce the prediction base on output of nn model
    Ylogits = neural_network_model(x)

    # measure the error use build in cross entropy function, the value that we want to minimize
    cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=Ylogits, labels=Y_C))

    # To optimize our cost (cross_entropy), reduce error, default learning_rate is 0.001, but you can change it, this case we use default
    # optimizer = tf.train.GradientDescentOptimizer(0.003)
    optimizer = tf.train.AdamOptimizer(lr)
    train_step = optimizer.minimize(cross_entropy)

    # start the session
    with tf.Session() as sess:
        # We initialize all of our variables first before start
        sess.run(tf.global_variables_initializer())
        # iterate epoch count time (cycles of feed forward and back prop), each epoch means neural see through all train_data once
        for epoch in range(total_iterations):
            # count the total cost per epoch, declining mean better result
            epoch_loss=0
            i=0
            decay_speed = 150

            # current learning rate
            learning_rate = min_learning_rate + (max_learning_rate - min_learning_rate) * exp(-epoch/decay_speed)

            # divide the dataset in to data_set/batch_size in case run out of memory
            while i < len(train_x):
                # load train data
                start = i
                end = i + block_size
                batch_x = np.array(train_x[start:end])
                batch_y = np.array(train_y[start:end])
                train_data = {x: batch_x, Y_C: batch_y, lr: learning_rate}
                # train
                # sess.run(train_step,feed_dict=train_data)
                # run optimizer and cost against batch of data.
                _, c = sess.run([train_step, cross_entropy], feed_dict=train_data)
                epoch_loss += c
                i+=block_size

            # print iteration status
            printf("epoch: %5d/%d , loss: %f", epoch, total_iterations, epoch_loss)

            # terminate training when loss < expected_loss
            if epoch_loss < expected_loss:
                break

        # how many predictions we made that were perfect matches to their labels
        # test model
        # test data 
        test_data = {x:test_x, Y_C:test_y}
        # calculate accuracy
        correct_prediction = tf.equal(tf.argmax(Ylogits, 1), tf.argmax(Y_C, 1))
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, 'float'))
        print('Accuracy:',accuracy.eval(test_data))
        # result matrix, return the position of 1 in array
        result = (sess.run(tf.argmax(Ylogits.eval(feed_dict=test_data),1)))
        answer = []
        for i in range(len(test_y)):
            if test_y[i] == [0,1]:
                answer.append(1)
            elif test_y[i]==[1,0]:
                answer.append(0)
        answer = np.array(answer)
        printResultCompare(result,answer)

        # save the prediction of correctness
        np.savetxt('nn_prediction.txt', Ylogits.eval(feed_dict={x: test_x}), delimiter=',',newline="\r\n")

        # save the nn model for later use again
        # 'Saver' op to save and restore all the variables
        saver = tf.train.Saver()
        saver.save(sess, save_path)
        #print("Model saved in file: %s" % save_path)

# load the trained neural network model
def test_loaded_neural_network(trained_NN_path):
    Ylogits = neural_network_model(x)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        # load saved model
        saver.restore(sess, trained_NN_path)
        print("Loading variables from '%s'." % trained_NN_path)
        np.savetxt('nn_prediction.txt', Ylogits.eval(feed_dict={x: test_x}), delimiter=',',newline="\r\n")
        # test model

        # result matrix
        result = (sess.run(tf.argmax(Ylogits.eval(feed_dict={x:test_x}),1)))
        # answer matrix
        answer = []
        for i in range(len(test_y)):
            if test_y[i] == [0,1]:
                answer.append(1)
            elif test_y[i]==[1,0]:
                answer.append(0)
        answer = np.array(answer)

        printResultCompare(result,answer)

        # calculate accuracy
        correct_prediction = tf.equal(tf.argmax(Ylogits, 1), tf.argmax(Y_C, 1))

        print(Ylogits.eval(feed_dict={x: test_x}).shape)






train_neural_network(x)

#test_loaded_neural_network(save_path)

So, can anyone help point us to the right place to make the edits that we need to make to resolve our problem? (i.e. what is the name of the file we need to edit, and where is it located.) Thanks in advance!

-gt-

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

1

The answer you want:

You should add these codes in your train_neural_network(x) function.

ratio            = (num of classes 1) / ((num of classes 0) + (num of classes 1))
class_weight     = tf.constant([[ratio, 1.0 - ratio]])
Ylogits          = neural_network_model(x)
weight_per_label = tf.transpose( tf.matmul(Y_C , tf.transpose(class_weight)) )  
cross_entropy    = tf.reduce_mean( tf.mul(weight_per_label, tf.nn.softmax_cross_entropy_with_logits(logits=Ylogits, labels=Y_C) ) )
optimizer        = tf.train.AdamOptimizer(lr)
train_step       = optimizer.minimize(cross_entropy)

instead of these lines:

Ylogits = neural_network_model(x)

# measure the error use build in cross entropy function, the value that we want to minimize
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=Ylogits, labels=Y_C))

# To optimize our cost (cross_entropy), reduce error, default learning_rate is 0.001, but you can change it, this case we use default
# optimizer = tf.train.GradientDescentOptimizer(0.003)
optimizer = tf.train.AdamOptimizer(lr)
train_step = optimizer.minimize(cross_entropy)

More Details:

Since in neural network, we calculate the error of prediction with respect to the targets( the true labels ), in your case, you use the cross entropy error which finds the sum of targets multiple Log of predicted probabilities.

The optimizer of network backpropagates to minimize the error to achieve more accuracy.

Without weighted loss, the weight for each class are equals, so optimizer reduce the error for the classes which have more amount and overlook the other class.

So in order to prevent this phenomenon, we should force the optimizer to backpropogate larger error for class with small amount, to do this we should multiply the errors with a scalar.

I hope it was useful :)

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 02 '17 at 15:17

Ali Abbasi

894
9
22

Thanks for the help - I really appreciate it. I've been trying to get this to work, but its the weekend and my assistant is at home with the code on his laptop, and so we're emailing back and forth, so its slow. We will meet tomorrow some time to get caught up. However, in the meantime, we got as far as trying to work on a problem with the code: weight_per_label = tf.transpose( tf.matmul(Y_C , tf.transpose(class_weight)) ) – George Townsend Jun 04 '17 at 01:47
tf.matmul(Y_C , tf.transpose(class_weight)) ) has my labels Y_C as int8 and the class_weights vector as float32 so matmul won't work. I thought if I replaced Y_C with tf.to_float(Y_C) that this would fix it, but it seemed to cause a new problem later in a different part of the code. Was I wrong to do the tf.to_float()??? – George Townsend Jun 04 '17 at 01:58
It still didn't work, but we realized that the epoch_loss in our code changed from a scalar to a vector (giving problems) and realized we needed to make one further change (IN CAPS) which I'm hoping you can comment on: cross_entropy = TF.REDUCE_MEAN(tf.mul(weight_per_label, tf.nn.softmax_cross_entropy_with_logits(logits=Ylogits, labels=Y_C))) so that we actually get a scalar out of the deal. Is that a reasonable addition? The code finally seems to run and produce reasonable results, but I'm not certain if we are on good ground, so I thought I'd ask ... – George Townsend Jun 04 '17 at 14:37
Yeah your right, I forgot to add `tf.reduce_mean`, I'll edit my answer may someone use it. In my opinion you are in the good ground, feel free to post your results here (mostly loss curve and accuracy curve), then we can see what happens in your model. – Ali Abbasi Jun 04 '17 at 17:03
Okay, thanks a lot for the help. My assistant and I were at a standstill until you helped,so we both really appreciate you taking the time to sort this out. We were on a wild goose chase trying to find the appropriate TensorFlow source code to edit when it was really our own code that we needed to focus on. So you saved us a lot of time and trouble! Thanks again. – George Townsend Jun 05 '17 at 04:03
Ur welcome, I'm glad to hear that, now you can accept my answer as True, so others can refer them if they need :) – Ali Abbasi Jun 05 '17 at 16:38
I'd be happy to do that, if you could tell me how. I clicked on the upvote arrow for your answer. A message popped up that said my vote would be recorded but not made public. Is there something else I need to click on to indicate that the answer was good? – George Townsend Jun 06 '17 at 04:02

Editing TensorFlow Source to fix unbalanced data

1 Answers1

The answer you want:

More Details: