0

Let's say input to intermediate CNN layer is of size 512×512×128 and that in the convolutional layer we apply 48 7×7 filters at stride 2 with no padding. I want to know what is the size of the resulting activation map?

I checked some previous posts (e.g., here or here) to point to this Stanford course page. And the formula given there is (W − F + 2P)/S + 1 = (512 - 7)/2 + 1, which would imply that this set up is not possible, as the value we get is not an integer.

However if I run the following snippet in Python 2.7, the code seems to suggest that the size of activation map was computed via (512 - 6)/2, which makes sense but does not match the formula above:

>>> import torch
>>> conv = torch.nn.Conv2d(in_channels=128, out_channels=48, kernel_size=7, stride=2, padding=0)
>>> conv
Conv2d(128, 48, kernel_size=(7, 7), stride=(2, 2))
>>> img = torch.rand((1, 128, 512, 512))
>>> out = conv(img)
>>> out.shape
(1, 48, 253, 253)

Any help in understanding this conundrum is appreciated.

A B
  • 85
  • 1
  • 6

1 Answers1

1

Here is the formula being used in pytorch: conv2d(go to the shape section)

Also, as far as I know, this is the best tutorial on this subject.

Bonus: here is a neat visualizer for conv calculations.

Separius
  • 1,226
  • 9
  • 24
  • The visualization is great. I set input 6x6 and kernel size 3x3 with padding 0 and stride 2 and realized there is an 'early termination' condition for sliding the convolution (i.e., if stride is too large compared to how much of the image is left at the end). So, the formula is not entirely correct, as it does not take this into account. – A B Dec 16 '19 at 09:02
  • @AB I think the visualization is correct and it's consistent with the formula in pytorch. Hin=6, K=3, P=0, D=1, S=2; gives Hout=2 in both formula and the visualization. – Separius Dec 16 '19 at 09:14
  • I was referring to the formula I got from other sources (W − F + 2P)/S + 1, I agree that pytorch formula makes sense. – A B Dec 16 '19 at 09:20
  • @AB oh I see. I think your formula is also right, it only needs a `floor()` and it will be the same as pytorch with dilation=1 (default mode) – Separius Dec 16 '19 at 09:24
  • 1
    To quote the Stanford course page mentioned in my post: "For example, when the input has size W=10, no zero-padding is used P=0, and the filter size is F=3, then it would be impossible to use stride S=2, since (W−F+2P)/S+1=(10−3+0)/2+1=4.5, i.e. not an integer.... Therefore, this setting of the hyperparameters is considered to be invalid and a ConvNet library could throw an exception or zero pad the rest to make it fit, or crop the input to make it fit, or something." So, my guess it is an implementation point and pytorch just gracefully handles it – A B Dec 16 '19 at 09:32