Is the Keras implementation of dropout correct?

Question

The Keras implementation of dropout references this paper.

The following excerpt is from that paper:

The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2.

The Keras documentation mentions that dropout is only used at train time, and the following line from the Dropout implementation

x = K.in_train_phase(K.dropout(x, level=self.p), x)

seems to indicate that indeed outputs from layers are simply passed along during test time.

Further, I cannot find code which scales down the weights after training is complete as the paper suggests. My understanding is this scaling step is fundamentally necessary to make dropout work, since it is equivalent to taking the expected output of intermediate layers in an ensemble of "subnetworks." Without it, the computation can no longer be considered sampling from this ensemble of "subnetworks."

My question, then, is where is this scaling effect of dropout implemented in Keras, if at all?

Update 1: Ok, so Keras uses inverted dropout, though it is called dropout in the Keras documentation and code. The link http://cs231n.github.io/neural-networks-2/#reg doesn't seem to indicate that the two are equivalent. Nor does the answer at https://stats.stackexchange.com/questions/205932/dropout-scaling-the-activation-versus-inverting-the-dropout. I can see that they do similar things, but I have yet to see anyone say they are exactly the same. I think they are not.

So a new question: Are dropout and inverted dropout equivalent? To be clear, I'm looking for mathematical justification for saying they are or aren't.

Marcin Możejko · Answer 1 · 2016-07-29T10:57:03.277

Yes. It is implemented properly. From the time when Dropout was invented - folks improved it also from the implementation point of view. Keras is using one of this techniques. It's called inverted dropout and you may read about it here.

UPDATE:

To be honest - in the strict mathematical sense this two approaches are not equivalent. In inverted case you are multiplying every hidden activation by a reciprocal of dropout parameter. But due to that derivative is linear it is equivalent to multiplying all gradient by the same factor. To overcome this difference you must set different learning weight then. From this point of view this approaches differ. But from a practical point view - this approaches are equivalent because:

If you use a method which automatically sets the learning rate (like RMSProp or Adagrad) - it will make almost no change in algorithm.
If you use a method where you set your learning rate automatically - you must take into account the stochastic nature of dropout and that due to the fact that some neurons will be turned off during training phase (what will not happen during test / evaluation phase) - you must to rescale your learning rate in order to overcome this difference. Probability theory gives us the best rescalling factor - and it is a reciprocal of dropout parameter which makes the expected value of a loss function gradient length the same in both train and test / eval phases.

Of course - both points above are about inverted dropout technique.

Thanks for this. Any chance you could provide insight into the second question I have asked? — user3390629, Jul 27 '16 at 13:16
Ok, can you explain why are they equivalent (in the answer)? It seems the gradients calculated for network 1 using dropout and network 2 using inverted dropout will be different, and thus they will converge to different final states. — user3390629, Jul 27 '16 at 13:34

score 3 · Answer 2 · answered Jun 01 '17 at 03:28

Excerpted from the original Dropout paper (Section 10):

In this paper, we described dropout as a method where we retain units with probability p at training time and scale down the weights by multiplying them by a factor of p at test time. Another way to achieve the same effect is to scale up the retained activations by multiplying by 1/p at training time and not modifying the weights at test time. These methods are equivalent with appropriate scaling of the learning rate and weight initializations at each layer.

score 0 · Answer 3 · answered Jan 06 '20 at 20:31

Note though, that while keras's dropout layer is implemented using inverted dropout. The rate parameter the opposite of keep_rate.

keras.layers.Dropout(rate, noise_shape=None, seed=None)

Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting.

That is, rate sets the rate of dropout and not the rate to keep which you would expect with inverted dropout

Keras Dropout

Is the Keras implementation of dropout correct?

3 Answers3

Linked