Multi label classification with TENSORFLOW---NaN in COST and Weights

Question

Contextualization
I am building a neural network for multi label classification: Identifying labels in an image (what clothes a person is wearing, its color,etc..). I wanted to use pure tensorflow (instead of APIs like kears) in order to have more flexibility over my metrics.
P.S: The data used for this tensorflow model was tested with a Keras built model and didn't produce the issues that I am going to expose here.

Data
My input data are (X,Y): X is of shape (1814,204,204,3) and Y is of shape (1814,39). So basically X are the set of images and Y are the labels associated to each images which will be used for the supervised learning process.
There are 39 labels in total, so for every image of size (1,204,204,3) we associate a vector of shape (1,39) : the 39 values can be 0 or 1. 1,if the corresponding label is identified in that image , O else. Many labels can be identified at the same time, which means that we are not using one hot encoding and it is not a multi-class classification situation!
PS:I already normalized my data in order to be fitted in [0,1]

What I have done
1. First thing I have done is buidling the abstract version of my classifier(which is a CNN): here is the structure of my CNN:

# Convolutional Layer 1
# Dropout layer 1
# Convolutional Layer 2
# Pooling Layer 2
# Dense layer 3  
# Dropout layer 3   
# Dense layer 4

for a given dataset of size (?,204,204,3): here is the flow of the data through the different layers:

conv1 OUTPUT shape:  (?, 204, 204, 32)  
drop1 OUTPUT shape:  (?, 204, 204, 32)  
conv2 OUTPUT shape:  (?, 204, 204, 32)  
pool2 OUTPUT shape:  (?, 102, 102, 32)  
dense3 OUTPUT shape:  (?, 512)  
drop3 OUTPUT shape:  (?, 512)  
dense4 OUTPUT shape:  (?, 39)

Here is the code for building the structure of the CNN

   def create_model(X,Y):
        # Convolutional Layer #1
        conv1 = tf.layers.conv2d(
          inputs=X,
          filters=32,
          kernel_size=[3, 3],
          padding="same",
          activation=tf.nn.relu)
        print('conv1 OUTPUT shape: ',conv1.shape)

        # Dropout layer #1
        dropout1 = tf.layers.dropout(
          inputs=conv1, rate=0.2, training='TRAIN' == tf.estimator.ModeKeys.TRAIN)
        print('drop1 OUTPUT shape: ',dropout1.shape)

        # Convolutional Layer #2
        conv2 = tf.layers.conv2d(
          inputs=dropout1,
          filters=32,
          kernel_size=[3, 3],
          padding="same",
          activation=tf.nn.relu)
        print('conv2 OUTPUT shape: ',conv2.shape)

        # Pooling Layer #2
        pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2],strides=2)
        print('pool2 OUTPUT shape: ',pool2.shape)
        pool2_flat = tf.reshape(pool2, [-1, pool2.shape[1]*pool2.shape[2]*pool2.shape[3]])

        # Dense layer #3
        dense3 = tf.layers.dense(inputs=pool2_flat, units=512, activation=tf.nn.relu)
        print('dense3 OUTPUT shape: ',dense3.shape)

        # Dropout layer #3
        dropout3 = tf.layers.dropout(
          inputs=dense3, rate=0.5, training='TRAIN' == tf.estimator.ModeKeys.TRAIN)
        print('drop3 OUTPUT shape: ',dropout3.shape)

        # Dense layer #4
        Z = tf.layers.dense(inputs=dropout3, units=39, activation=tf.nn.sigmoid)
        print('dense4 OUTPUT shape: ',Z.shape)

        return Z

2. Now , I am defining my cost function and my optimizer.

For the cost function I am using cross_entropy_with_logits and calculate independantly the mean for all of my output components over the batch sample. For example, if I Have a batch of size 10, the output of the model is of shape (10,39), so for the cost we will have a vector of shape (1,39) (for each label we calculate the mean over the different exemples in the batch)
For the optimizer I am using adam optimizer.

Here is the code for calculating the cost and optimizer.

def optimizer_and_cost(output,labels):
    # Calculating cost
    cost= tf.reduce_mean(labels * - tf.log(output) + (1 - labels) * - tf.log(1 - output),axis=0)
    print('cost: shape of cost: ',cost.shape)
    cost= tf.reshape(cost, [1, 39])
    print('cost reshaped: shape of cost reshaped: ',cost.shape)

    #Optimizer
    optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost)
    return optimizer,cost

PS: The 'axis=0' in tf.reduce_mean is what allows me to calculate for each label independantly , the mean over the batch examples!

3. Defining Placeholders, initializing model and training.
Once my abstract model with different parameters is defined, I created placeholders and built the computational graphs, then I initialized the weights and started the training. Issues: I started having NaN values for the weights in the different layers and NaNs in the cost function as the optimization goes. So first Reflexe was trying to debug and understand what happens.
I tried to test a simple case which is as follow:
initializing weights---> calculate cost and print it ( print weights too) ---> do one optimization---> calcaluate cost and print it( print weights too) .
Result:
first print is fine I have real values (pretty obvious). However after first optimization: I got NaNs values for the cost. Why does my optimizer make the Cost go NaN after one optimization step !
here is the code for the test! (X_train and Y_train are of shape(1269, 204, 204, 3) and (1269,39) : I am taking only 4 elements of each to test )

#clearing the graph
ops.reset_default_graph()

#defining placeholders
X = tf.placeholder(tf.float32, [None, X_train.shape[1],X_train.shape[2],X_train.shape[3]])
Y = tf.placeholder(tf.float32, [None, Y_train.shape[1]])
optimizer, cost=optimizer_and_cost(create_model(X,Y),Y)

# Initialize all the variables globally
init = tf.global_variables_initializer()

# Start the session to compute the tensorflow graph
sess=tf.Session()
sess.run(init)

#printing cost and first layers weights
print('first layer weights ',sess.run(tf.trainable_variables()[0]) )
print('cost: ',sess.run(cost,feed_dict={X:X_train[0:4,:], Y:Y_train[0:4,:]}))

#doing one optimization step
_ ,OK=sess.run([optimizer, cost], feed_dict={X:X_train[0:4,:], Y:Y_train[0:4,:]})


#printing cost and first layers weights
print('first layer weights ',sess.run(tf.trainable_variables()[0]) )
print('cost :',sess.run(cost,feed_dict={X:X_train[0:4,:], Y:Y_train[0:4,:]}))

#closing session
sess.close()

Any help is welcome.

When creating the CNN, how did you decide between the layers you wanted (Convolutional, Pooling, Dense, or Drop)? Why do you use one vs. another? I also noticed when you were printing the shape of drop1 you printed conv1.shape. There also doesn't seem to be an input layer before your conv1. — bmc, May 04 '18 at 13:20
1. you are right! i printed conv1 instead of drop1 . I corrected it. 2. I am using the well know way of constructing CNN which is starting by a Convolutional layer than adding pooling to reduce shape of outputs , and added dropout to avoid overfitting issues. 3. I think there is no need to have an input layer before conv1 since in the arguments of conv1 you are precising the input which is X — mouni93, May 04 '18 at 15:47

score 1 · Answer 1 · answered May 04 '18 at 16:21

This link solved my issue : Tensorflow NaN bug?

basically, when calculating y*log(y) it hapens to have 0 log(0) , so the solution is in the link provided. Thanks anyways guys for help .

Replacing

  cost= tf.reduce_mean(labels * - tf.log(output) + (1 - labels) * - tf.log(1 - output),axis=0)

With this

cost= tf.reduce_mean(labels * - tf.log(tf.clip_by_value(output,1e-10,1.0)) + (1 - labels) * - tf.log(tf.clip_by_value(1 - output,1e-10,1.0)),axis=0)

nitred · Answer 2 · 2018-05-04T13:18:25.403

It is advisable to use tf.nn.softmax_cross_entropy_with_logits_v2 instead of implementing it yourself since it covers a lot of the corner cases that usually lead to nan losses.

The way you use tf.nn.softmax_cross_entropy_with_logits_v2 is by making the activation of Dense Layer 4 to be linear instead of softmax. This way the outputs of your Dense Layer 4 will be logits, which can then be fed directly into tf.nn.softmax_cross_entropy_with_logits_v2.

Finally make sure you carefully read the following:

EDIT: My bad, I didn't read the question carefully enough so I missed the fact that you said that it is not a multi-class classification problem. If it is not a multi-class classification problem then it may be beyond my current expertise. So I will leave you with another link that you and I both can carefully read.

Tensorflow sigmoid and cross entropy vs sigmoid_cross_entropy_with_logits

Multi label classification with TENSORFLOW---NaN in COST and Weights

2 Answers2