Size of Input and ConvNet

Question

In CS231n course about Convolution Neural Network, in ConvNet note:

INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B.

CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters.

From the document, I understand that a INPUT will contain images with 32 (width) x 32 (height) x 3 depth. But later in result of Conv layer, it was [32x32x12] if we decided to use 12 filters. Where is the 3 as in depth of the image?

Please help me out here, thank you in advance.

score 1 · Answer 1 · answered Mar 08 '18 at 06:43

1

It gets "distributed" to each feature map (result after convolution with filter).

Before thinking about 12 filters, just think of one. That is, you are applying convolution with a filter of [filter_width * filter_height * input_channel_number]. And because your input_channel_number is the same as filter channel, you basically applying input_channel_number of 2d convolution independently on each input channel and then sum them together. And the result is a 2D feature map.

Now you can repeat this 12 times to get 12 feature maps and stack them together to get your [32 x 32 x 12] feature volume. And that's why your filter size is a 4D vector with [filter_width * filter_height * input_channel_number * output_channel_number], in your case this should be something like [3x3x3x12] (please note the ordering may vary between different framework, but operation is the same)

answered Mar 08 '18 at 06:43

Qianyi Zhang

184
6

So `32` in Input is different from `32` in Conv Layer result, right? I don't know if I get this right: In every position that the filter slides to, in the receptive field the value (x*weight) adds up to a number only. This remains number 32 because the stride = 1. – Huyen Mar 08 '18 at 07:11
I am not sure how do you define "different" in this case, or "same" for that matter. the 32 in the Conv Layer result remains 32 if you use (it's a parameter, default is "same") padding for you convolution, and if you use "valid" which means no padding, your output side would be 32-filter size +1 (which is 30 if you convolve with 3x3 filter) – Qianyi Zhang Mar 12 '18 at 02:08
1

https://stackoverflow.com/questions/37674306/what-is-the-difference-between-same-and-valid-padding-in-tf-nn-max-pool-of-t – Qianyi Zhang Mar 12 '18 at 02:08

score 0 · Accepted Answer · answered Apr 18 '18 at 10:30

So, this is fun. I have read the document again and found the answer which is some 'scroll down' away. Before, I thought the filter, for example, is 32 x 32 (no depth). The truth is:

A typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels).

During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position.

Size of Input and ConvNet

2 Answers2