Dropout makes a strange result for Convolutional NN based on The CIFAR-10 dataset

Question

To avoid over-fitting, I try to use Dropout in the fully connected layer of CNN based on The CIFAR-10 dataset. I get a strange result. Loss is significantly reduced fair quickly. But the Test accuracy is no improvement at all. What is wrong? Any help is highly appreciated! see print out as below:

Generation # 5. Train loss: 543.70. Train acc (test acc): 14.00 (11.50)
Generation # 10. Train loss: 390.62. Train acc (test acc): 7.50 (11.50)
Generation # 15. Train loss: 286.08. Train acc (test acc): 13.50 (10.50)
Generation # 20. Train loss: 211.68. Train acc (test acc): 12.00 (11.00)
Generation # 25. Train loss: 180.75. Train acc (test acc): 7.50 (11.00)
Generation # 30. Train loss: 140.63. Train acc (test acc): 14.50 (17.00)
Generation # 35. Train loss: 123.40. Train acc (test acc): 17.00 (15.50)
Generation # 40. Train loss: 107.11. Train acc (test acc): 13.00 (11.50)
Generation # 45. Train loss: 96.01. Train acc (test acc): 16.50 (12.50)
Generation # 50. Train loss: 68.94. Train acc (test acc): 18.50 (15.00)
Generation # 55. Train loss: 65.62. Train acc (test acc): 12.00 (17.00)
Generation # 60. Train loss: 47.64. Train acc (test acc): 19.00 (18.00)
Generation # 65. Train loss: 33.38. Train acc (test acc): 21.00 (15.50)
Generation # 70. Train loss: 29.28. Train acc (test acc): 17.00 (14.00)
Generation # 75. Train loss: 22.45. Train acc (test acc): 13.00 (18.00)
Generation # 80. Train loss: 17.00. Train acc (test acc): 11.50 (14.00)
Generation # 85. Train loss: 10.91. Train acc (test acc): 10.50 (10.50)
Generation # 90. Train loss: 8.18. Train acc (test acc): 12.00 (9.50)
Generation # 95. Train loss: 7.07. Train acc (test acc): 10.50 (10.00)
Generation # 100. Train loss: 5.05. Train acc (test acc): 14.00 (15.50)
Generation # 105. Train loss: 3.97. Train acc (test acc): 14.00 (16.00)
Generation # 110. Train loss: 3.90. Train acc (test acc): 10.50 (4.50)
Generation # 115. Train loss: 3.83. Train acc (test acc): 11.50 (11.00)
Generation # 120. Train loss: 4.25. Train acc (test acc): 8.50 (10.50)
Generation # 125. Train loss: 3.28. Train acc (test acc): 6.50 (12.50)
Generation # 130. Train loss: 3.59. Train acc (test acc): 13.00 (8.00)

full code of CNN is below:

batch_size = 200
learning_rate = 0.0001
evaluation_size = 200
image_width = train_x[0].shape[0]
image_height = train_x[0].shape[1]
target_size = max(train_labels) + 1
num_channels = 3
generations = 20000
eval_every = 5
conv1_features = 32
conv2_features = 32
conv3_features = 64
max_pool_size1 = 2
max_pool_size2 = 2
max_pool_size3 = 2
fully_connected_size1 = 100
dropout_rate = 0.5

keep_prob = tf.placeholder(tf.float32)
x_input_shape = (batch_size, image_width, image_height, num_channels)
x_input = tf.placeholder(tf.float32, shape=x_input_shape)
y_target = tf.placeholder(tf.int32, shape=(batch_size))
eval_input_shape = (evaluation_size, image_width, image_height, num_channels)
eval_input = tf.placeholder(tf.float32, shape=eval_input_shape)
eval_target = tf.placeholder(tf.int32, shape=(evaluation_size))

conv1_weight = tf.Variable(tf.truncated_normal([5,5,num_channels,conv1_features], stddev=0.1, dtype=tf.float32))
conv1_bias = tf.Variable(tf.zeros([conv1_features], dtype=tf.float32))
conv2_weight = tf.Variable(tf.truncated_normal([5,5,conv1_features,conv2_features], stddev=0.1, dtype=tf.float32))
conv2_bias = tf.Variable(tf.zeros([conv2_features], dtype=tf.float32))
conv3_weight = tf.Variable(tf.truncated_normal([5,5,conv2_features,conv3_features], stddev=0.1, dtype=tf.float32))
conv3_bias = tf.Variable(tf.zeros([conv3_features], dtype=tf.float32))

resulting_width = image_width // (max_pool_size1 * max_pool_size2 * max_pool_size3)
resulting_height = image_height // (max_pool_size1 * max_pool_size2 * max_pool_size3)
full1_input_size = resulting_width * resulting_height * conv3_features
full1_weight = tf.Variable(tf.truncated_normal([full1_input_size,fully_connected_size1], stddev=0.1, dtype=tf.float32))
full1_bias = tf.Variable(tf.truncated_normal([fully_connected_size1], stddev=0.1, dtype=tf.float32))
full2_weight = tf.Variable(tf.truncated_normal([fully_connected_size1, target_size], stddev=0.1, dtype=tf.float32))
full2_bias = tf.Variable(tf.truncated_normal([target_size], stddev=0.1, dtype=tf.float32))

# define net
def my_conv_net(input_data):
    # 1st conv relu maxpool layer
    conv1 = tf.nn.conv2d(input_data, conv1_weight, strides=[1,1,1,1], padding='SAME')
    relu1 = tf.nn.relu(tf.nn.bias_add(conv1, conv1_bias))
    max_pool1 = tf.nn.max_pool(relu1, ksize=[1,max_pool_size1,max_pool_size1,1], 
            strides=[1, max_pool_size1, max_pool_size1, 1], padding='SAME')
    # 2nd conv relu maxpool
    conv2 = tf.nn.conv2d(max_pool1, conv2_weight, strides=[1,1,1,1], padding='SAME')
    relu2 = tf.nn.relu(tf.nn.bias_add(conv2, conv2_bias))
    max_pool2 = tf.nn.max_pool(relu2, ksize=[1,max_pool_size2,max_pool_size2,1], 
            strides=[1, max_pool_size2, max_pool_size2, 1], padding='SAME')
    # 3nd conv relu maxpool
    conv3 = tf.nn.conv2d(max_pool2, conv3_weight, strides=[1,1,1,1], padding='SAME')
    relu3 = tf.nn.relu(tf.nn.bias_add(conv3, conv3_bias))
    max_pool3 = tf.nn.max_pool(relu3, ksize=[1,max_pool_size3,max_pool_size3,1], 
            strides=[1, max_pool_size3, max_pool_size3, 1], padding='SAME')
    # Transform output into 1xN layer for next fully connected layer
    final_conv_shape = max_pool3.get_shape().as_list()      # [batch_size/num of image, height, width, channel]
    final_shape = final_conv_shape[1] * final_conv_shape[2] * final_conv_shape[3]
    flat_output = tf.reshape(max_pool3, [final_conv_shape[0],final_shape])
    # 1st fully connected layer
    fully_connected1 = tf.nn.relu(tf.add(tf.matmul(flat_output, full1_weight), full1_bias))
    fully_connected1_dropout = tf.nn.dropout(fully_connected1, keep_prob)
    # 2nd fully connected layer
    final_model_output = tf.add(tf.matmul(fully_connected1_dropout, full2_weight), full2_bias)

    return final_model_output

# model output
model_output = my_conv_net(x_input)
test_model_output = my_conv_net(eval_input)    

# loss, sparse, mean label has been int, not one hot.
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=model_output, labels=y_target))

prediction = tf.nn.softmax(model_output)
test_prediction = tf.nn.softmax(test_model_output)

train_step = tf.train.AdamOptimizer(learning_rate).minimize(loss)
# create accuracy function
def get_accuracy(logits, targets):
    batch_predictions = np.argmax(logits, axis=1)
    num_correct = np.sum(np.equal(batch_predictions, targets))
    return 100. * num_correct / batch_predictions.shape[0]

# initialize variables
init = tf.global_variables_initializer()
sess.run(init)

train_loss = []
train_acc = []
test_acc = []
for i in range(generations):
    rand_index = np.random.choice(len(train_x), size=batch_size, replace=False)
    rand_x = train_x[rand_index] 
    rand_y = train_labels[rand_index]
    train_dict = {x_input: rand_x, y_target: rand_y, keep_prob: dropout_rate}
    sess.run(train_step, feed_dict=train_dict)
    temp_train_loss, temp_train_preds = sess.run([loss, prediction], feed_dict={x_input: rand_x, y_target: rand_y, keep_prob: 1})
    temp_train_acc = get_accuracy(temp_train_preds, rand_y)
    if (i+1) % eval_every == 0:
        eval_index = np.random.choice(len(test_x), size=evaluation_size)
        eval_x = test_x[eval_index]
        eval_y = test_labels[eval_index]
        test_dict = {eval_input: eval_x, eval_target: eval_y, keep_prob: 1}
        test_preds = sess.run(test_prediction, feed_dict=test_dict)
        temp_test_acc = get_accuracy(test_preds, eval_y)
        # record and print results
        train_loss.append(temp_train_loss)
        train_acc.append(temp_train_acc)
        test_acc.append(temp_test_acc)
        acc_and_loss = [(i+1), temp_train_loss, temp_train_acc, temp_test_acc]
        acc_and_loss = [np.round(x,2) for x in acc_and_loss]
        print('Generation # {}. Train loss: {:.2f}. Train acc (test acc): {:.2f} ({:.2f})'.format(*acc_and_loss))

The loss with cross entropy for random guess (10 classes) is -lg(0.1)=2.30. So Train loss 3.59 is not good at all. You can try to continue training (maybe with a smaller learning rate). Dropout is not the problem here but in general, when evaluating, we do not use dropout. — LI Xuhong, Apr 20 '17 at 13:13
I know. loss is not great. before I introduce dropout, the same loss will give me around 20 - 30 percentage accuracy already. and it takes time to go down to the loss. when In introduce dropout, accuracy is basically not moving.... — W.S., Apr 20 '17 at 13:18
I have no idea... Maybe it is because the dropout_rate is too strong (regularization force is too much), or maybe because the previous layer before dropout has only 100 units, after dropout, 50 units can not give much useful information for optimizing. But those are just my wild guess. You can also try to play with the dropout rate and learning rate, maybe it will work better. — LI Xuhong, Apr 20 '17 at 13:48

score 1 · Answer 1 · answered Oct 04 '17 at 21:12

Your model is definitely not learning, because even training accuracy is not improving. I couldn't spot obvious bugs in your code, so it looks like it's time to tune hyperparameters. My suggestions:

Try different learning rates: 0.01, 0.001. It's fairly possible that 0.0001 is too small (it doesn't look like it's too large, but who knows).
Try different initialization stddev: 0.001, 0.0001. I have seen cases, when this parameter was the only cause for not-learning.
Biases can be initialized with just zero.
Start with a larger dropout probability. Small keep_prob is good to fight overfitting, but at this state your model is not overfitting even remotely. You may well set it to 1, until you have at least something working.
I'd also started with a smaller filter size, e.g. 3x3 is better, especially when you don't have that many filters (32 and 64 in your case).
I'm not sure if the second FC layer is helping at all. Consider removing it, until you see the network learning.
By the way, even after troubleshooting it's often beneficial to tune hyperparameters to find the optimal ones. This will bring you some bonus accuracy, though it's critical at this point.

If nothing of this helps, I recommend you to visualize the distributions of activations in each layer, the distribution of gradients or weights in tensorboard to narrow down the problem.

OPs loss is huge, initialisation is thus the problem. All the remaining suggestions make sense of course, but none of these would case such huge initial loss. — lejlot, Oct 04 '17 at 21:22

Dropout makes a strange result for Convolutional NN based on The CIFAR-10 dataset

1 Answers1