4

i try to understand dilated convolution. I already familiar with increasing the size of the kernel by filling the gaps with zeros. Its usefull to cover a bigger area and get a better understanding about larger objects. But please can someone explain me how it is possible that dilated convolutional layers keep the origin resolution of the receptive field. It is used in the deeplabV3+ structure with a atrous rate from 2 to 16. How is it possible to use dilated convolution with a obvious bigger kernel without zero padding and the output size will be consistent.

deeplabV3+ structure:

enter image description here

Im confused because when i have a look at these explanation here:

enter image description here

The outputsize (3x3) of the dilated convolution layer is smaller?

Thank you so much for your help!

Lukas

Anwarvic
  • 12,156
  • 4
  • 49
  • 69
Lukas
  • 41
  • 1
  • 2

3 Answers3

1

Maybe there is a small confusion between strided convolution and dilated convolution here. Strided convolution is the general convolution operation that acts like a sliding window, but instead of jumping by a single pixel each time it uses a stride to allow jumping more than one pixel when moving from computing the convolution result for the current pixel and the next one. Dilated convolution is "looking" on a bigger window - instead of taking neighboring pixels, it takes them with "holes". The dilation factor defines the size of those "holes".

rkellerm
  • 5,362
  • 8
  • 58
  • 95
  • Thanks for your answer. Im familiar with strided convolutional layer. Lets imagine this example here input size 7x7. Here there is a dilated convolutional layer with dilation factor = 2. The result is a output size of 3x3. Imagine this operation with a standard convolutional layer (dilation factor=1) kernel 3x3 and the stride=1 the output size would be 5x5 pixel. How it is possible in the deeplab V3+ structure to have this consistent output resolution (output stride 16) with different dilation factors (from 2-16). – Lukas Mar 07 '19 at 17:17
0

Well, without padding the output would become smaller than the input. The effect is comparable to the reduction effect of a normal convolution.

Imagine you have a 1d-tensor with 1000 elements and a dilated 1x3 convolution kernel with dilation factor of 3. This corresponds to a "total kernel length" of 1+2free+1+2free+1 = 7. Considering a stride of 1 the output would be a 1d-tensor with 1000+1-7= 994 elements. In case of a normal convolution with a 1x3 kernel and a stride factor of 1 the output would have 1000+1-3= 998 elements. As you can see the effect can be calculated similar to a normal convolution :)

In both situation the output would become smaller without padding. But, as you can see, the dilation factor has no scaling effect on the output's size like it is the case for the stride factor.

Why do you think no padding is done within the deeplab framework? I think in the official tensorflow implementation padding is used.

Best Frank

FranklynJey
  • 646
  • 4
  • 8
0

My understanding is that the authors are saying that one does not need to downsample image (or, any intermediate feature map) before applying let's say 3x3 convolution which is typical in DCNNs (e.g., VGG16 or ResNet) for feature extraction and followed by upsampling for semantic segmentation. In a typical encoder-decoder network (e.g. UNet or SegNet), one first downsamples the feature map by half, followed by convolution operation and upsampling the feature map again by 2x times.

All of these effects (downsampling, feature extraction and upsampling) can be captured in a single atrous convolution (of course with stride=1). Moreover, the output of an atrous convolution is a dense feature map comparing to same "downsampling, feature extraction and upsampling" which results in a spare feature map. See the following figure for more details. It is from DeepLabV1 paper. Therefore, you can control the size of a feature map by replacing any normal convolution by atrous convolution in an intermediate layer.

That's also why there is a constant "output_stride (input resolution / feature map resolution)" of 16 in all the atrous convolutions in the picture (cascaded model) you posted above.

enter image description here

Sanchit
  • 3,180
  • 8
  • 37
  • 53