46

Occasionally I see some models are using SpatialDropout1D instead of Dropout. For example, in the Part of speech tagging neural network, they use:

model = Sequential()
model.add(Embedding(s_vocabsize, EMBED_SIZE,
                    input_length=MAX_SEQLEN))
model.add(SpatialDropout1D(0.2)) ##This
model.add(GRU(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2))
model.add(RepeatVector(MAX_SEQLEN))
model.add(GRU(HIDDEN_SIZE, return_sequences=True))
model.add(TimeDistributed(Dense(t_vocabsize)))
model.add(Activation("softmax"))

According to Keras' documentation, it says:

This version performs the same function as Dropout, however it drops entire 1D feature maps instead of individual elements.

However, I am unable to understand the meaning of entrie 1D feature. More specifically, I am unable to visualize SpatialDropout1D in the same model explained in quora. Can someone explain this concept by using the same model as in quora?

Also, under what situation we will use SpatialDropout1D instead of Dropout?

Maxim
  • 52,561
  • 27
  • 155
  • 209
Raven Cheuk
  • 2,903
  • 4
  • 27
  • 54

2 Answers2

53

To make it simple, I would first note that so-called feature maps (1D, 2D, etc.) is our regular channels. Let's look at examples:

  1. Dropout(): Let's define 2D input: [[1, 1, 1], [2, 2, 2]]. Dropout will consider every element independently, and may result in something like [[1, 0, 1], [0, 2, 2]]

  2. SpatialDropout1D(): In this case result will look like [[1, 0, 1], [2, 0, 2]]. Notice that 2nd element was zeroed along all channels.

Dilshat
  • 1,088
  • 10
  • 12
37

The noise shape

In order to understand SpatialDropout1D, you should get used to the notion of the noise shape. In plain vanilla dropout, each element is kept or dropped independently. For example, if the tensor is [2, 2, 2], each of 8 elements can be zeroed out depending on random coin flip (with certain "heads" probability); in total, there will be 8 independent coin flips and any number of values may become zero, from 0 to 8.

Sometimes there is a need to do more than that. For example, one may need to drop the whole slice along 0 axis. The noise_shape in this case is [1, 2, 2] and the dropout involves only 4 independent random coin flips. The first component will either be kept together or be dropped together. The number of zeroed elements can be 0, 2, 4, 6 or 8. It cannot be 1 or 5.

Another way to view this is to imagine that input tensor is in fact [2, 2], but each value is double-precision (or multi-precision). Instead of dropping the bytes in the middle, the layer drops the full multi-byte value.

Why is it useful?

The example above is just for illustration and isn't common in real applications. More realistic example is this: shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n]. In this case, each batch and channel component will be kept independently, but each row and column will be kept or not kept together. In other words, the whole [l, m] feature map will be either kept or dropped.

You may want to do this to account for adjacent pixels correlation, especially in the early convolutional layers. Effectively, you want to prevent co-adaptation of pixels with its neighbors across the feature maps, and make them learn as if no other feature maps exist. This is exactly what SpatialDropout2D is doing: it promotes independence between feature maps.

The SpatialDropout1D is very similar: given shape(x) = [k, l, m] it uses noise_shape = [k, 1, m] and drops entire 1-D feature maps.

Reference: Efficient Object Localization Using Convolutional Networks by Jonathan Tompson at al.

Maxim
  • 52,561
  • 27
  • 155
  • 209
  • I was inspecting the source code, after reading the same description at Tensorflow's documentation, and I had the strong impression that this description is exactly the opposite of what the code actually does.... – Daniel Möller May 17 '18 at 16:54
  • @DanielMöller Do you mean that `[k, 1, 1, n]` is not dropping the feature maps? My experiments show that they do. I'd say that keras doc is a bit misleading. – Maxim May 17 '18 at 17:03
  • I'm not sure about what "feature maps" mean there regarding dimensions, but from what I could tell by [this code](https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/python/ops/nn_ops.py/#L2262), this is masking "samples" and "channels". (But I should experiment this as well before concluding). – Daniel Möller May 17 '18 at 17:55
  • By the code, `[k,1,1,n]` seems to be removing certain samples and certain channels, without touching any spatial dimension. – Daniel Möller May 17 '18 at 17:56
  • 1
    Ok.... I think my problem is text interpretation here... when I read "kept independently" I thought it meant "not discarded no matter what", but actually it probably means "may be discarded independently from other axes". – Daniel Möller May 17 '18 at 18:10
  • @DanielMöller yes, my understanding is the same, and I illustrate it by an example in my answer. – Maxim May 17 '18 at 19:11
  • @Maxim When you explain the SpatialDropout1D, do you assume the input tensor is still `(2,2,2)`, and the noise_shape is `(1,2,2)`? If so, I think the number of zero elements can be `0`, `2`, `4`, `6` or `8`. And it cannot be `1`, `3`, `5`, `7`. Am I correct? – Raven Cheuk May 18 '18 at 02:39
  • @RavenCheuk oh, thanks for spotting it. Yes, it 0, 2, 4, 6, 8. I'll correct the typo – Maxim May 18 '18 at 13:40
  • @Maxim I am reading this answer pretty late. It looks like we can use a custom noise shape with normal `Dropout` layer as well. So say, if my input tensor is of shape `10, 8, 5`, and I use a normal `Dropout` layer with noise_shape `10, 8, 1`, should tat give me the same output as using a `SpatialDropout1D`? – Asif Iqbal Dec 26 '18 at 00:37
  • What do you mean by 8 independent coin flips as in the example that the data is [2, 2, 2] ? Here you have 3 data points, each coin flip corresponds only to 1 element in this array. So IMO there would be 3 coin flips here, or 3 independent events. Am I misunderstanding something. – StayFoolish May 22 '19 at 03:43
  • @StayFoolish He's referring to the dimensions of the tensor. [2, 2, 2] are not the actual data points. – Binoy Dalal Jul 28 '19 at 03:33