Confused usage of dropout in mini-batch gradient descent

Question

My question is in the end.

An example CNN trained with mini-batch GD and used the dropout in the last fully-connected layer (line 60) as

fc1 = tf.layers.dropout(fc1, rate=dropout, training=is_training)

At first I thought the tf.layers.dropout or tf.nn.dropout randomly sets neurons to zero in columns. But I recently found it's not the case. The below piece of code prints what the dropout does. I used the fc0 as a 4 sample x 10 feature matrix, and the fc as the dropped out version.

import tensorflow as tf
import numpy as np

fc0 = tf.random_normal([4, 10])
fc = tf.nn.dropout(fc0, 0.5)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

a, b = sess.run([fc0, fc])
np.savetxt("oo.txt", np.vstack((a, b)), fmt="%.2f", delimiter=",")

And in the output oo.txt (original matrix: line 1-4, dropped out matrix: line 5-8):

0.10,1.69,0.36,-0.53,0.89,0.71,-0.84,0.24,-0.72,-0.44
0.88,0.32,0.58,-0.18,1.57,0.04,0.58,-0.56,-0.66,0.59
-1.65,-1.68,-0.26,-0.09,-1.35,-0.21,1.78,-1.69,-0.47,1.26
-1.52,0.52,-0.99,0.35,0.90,1.17,-0.92,-0.68,-0.27,0.68
0.20,0.00,0.71,-0.00,0.00,0.00,-0.00,0.47,-0.00,-0.87
0.00,0.00,0.00,-0.00,3.15,0.07,1.16,-0.00,-1.32,0.00
-0.00,-3.36,-0.00,-0.17,-0.00,-0.42,3.57,-3.37,-0.00,2.53
-0.00,1.05,-1.99,0.00,1.80,0.00,-0.00,-0.00,-0.55,1.35

My understanding of the proper? dropout is, knocking out p% same units for each sample in a mini-batch or batch gradient descent phase, and the back-propagation updates the weights and biases of the "thinned network". However, in the implementation of the example, the neurons of each sample in one batch were randomly dropped out, as illustrated in the oo.txt line 5 to 8, and for each sample, the "thinned network" is different.

As a comparison, in a stochastic gradient descent case, samples are fed into the neural network one-by-one, and in each iteration, weights of each tf.layers.dropout introduced "thinned network" are updated.

My question is, in the mini-batch or batch training, shouldn't it be implemented to knock out same neurons for all samples in one batch? Maybe by applying one mask to all input batch samples at each iteration? Something like:

# ones: a 1xN all 1s tensor
# mask: a 1xN 0-1 tensor, multiply fc1 by mask with broadcasting along the axis of samples
mask = tf.layers.dropout(ones, rate=dropout, training=is_training)
fc1 = tf.multiply(fc1, mask)

Now I'm thinking the dropout strategy in the example may be a weighted way of updating weights of a certain neuron, that if a neuron is kept in 1 out of 10 samples in a mini-batch, its weights will be updated by alpha * 1/10 * (y_k_hat-y_k) * x_k, compared with alpha * 1/10 * sum[(y_k_hat-y_k) * x_k] for weights of another neuron kept in all 10 samples?

the screenshot from here

score 8 · Accepted Answer · answered Feb 05 '18 at 08:27

Dropouts are commonly used to prevent overfitting. In this case it would be a huge weight applied to one of the neurons. By randomly making it 0 from time to time, you force the network to use more neurons in determining the outcome. For this to work well you should drop different neurons for each example so that the gradient you compute is more similar to the one you would get without the dropout.

If you were to drop the same neurons for each example in the batch, my guess is that you will have a less stable gradient (might not matter for your application).

In addition dropout up-scales the rest of the values to keep the average activation at about the same level. Without it the network would learn wrong biases or would over-saturate when you turn dropout off.

If you still want the same neurons to be dropped in the batch then apply dropout to a all 1 tensor of shape (1, num_neurons) and then multiply it with the activations.

The random sampling explanation helps a lot to understand it in the **big-batch** gradient descent case. But for the **mini-batch** GD case, which I think has insufficient samples (several to dozens) to simulate the GD in the intact network. I guess it should be knocking out same neurons and updating weights for each iteration, which leaves the random sampling within each epoch (the bigger loop) rather than each mini-batch? — kbxu, Feb 05 '18 at 09:46
Like I said, it might work for your case just fine while in others it may introduce more instability. — Sorin, Feb 05 '18 at 12:23

score 3 · Answer 2 · answered May 12 '18 at 16:15

When using dropout, you are effectively trying to estimate the average performance of the network for a randomly chosen dropout mask, using Monte-Carlo sampling (by differentiation under the integral sign, the average gradient is equal to the gradient of the average). By fixing a dropout mask for each mini-batch, you are just introducing correlation between successive gradient estimates, which increases the variance and leads to slower training.

Imagine using a different dropout-mask for each image in the mini-batch, but forming the mini-batch from k copies of the same image; it's obvious that this would be a complete waste of effort!

Confused usage of dropout in mini-batch gradient descent

2 Answers2