1

I am learning convolutional neural network and trying to figure out how the mathematical computation takes place. Suppose there is an input image that has 3 channels (RGB), so the shape of the image is 28*28*3. Consider 6 filters are applied of size 5*5*3 and stride 1 for the next layer. Thus, we will get 24*24*6 in the next layer. Since the input image is an RGB image, how is each filter's 24*24 image interpreted as an RGB image, i.e, does each filter's internally constructs image of size 24*24*3 ?

Neel Shah
  • 319
  • 7
  • 16

1 Answers1

2

After you've applied the first convolutional layer, you can't think of it as being RGB anymore. That [5, 5, 3] convolution takes all of the information from 5*5*3 = 75 floats (25 pixels, each with 3 channels) and mixes it together based upon whatever parameters the network has trained for that filter.

In many image recognition tasks, the first layer often learns things like edge detectors and sharpening masks, etc. For example, see this visualization of the layers of VGG16.

But the output itself is just... information, at that point. Or, to be more precise, the meaning of the depth channels is going to depend on how the network has learned. There will probably be meaningful things that differentiate the depth channels (and what the different values in there mean), but it's unlikely to be intuitive without trying to visualize it. I don't know of a project that's visualized the depth channels independently, but someone might have.

etarion
  • 16,935
  • 4
  • 43
  • 66
dga
  • 21,757
  • 3
  • 44
  • 51
  • Okay that's quite clear. So at the first convolution layer each filter is applied to all 3 channels (R,G and B) and then this output is added elementwise. Is that correct ? – Neel Shah Jul 05 '16 at 06:17
  • It depends what you mean by applied. :) Each output element is the sum of all input elements in a "patch" (the same size as the filter) multiplied by the corresponding filter parameters. A filter of depth 'k' has k entirely different sets of filter parameters (it's 4D) that get multiplied by a 3D input patch. Each of those 'k' filters produces one output. This blog post may also present things in a way that's helpful: http://colah.github.io/posts/2014-07-Understanding-Convolutions/ The definition in the TF docs: https://www.tensorflow.org/versions/r0.9/api_docs/python/nn.html#conv2d – dga Jul 05 '16 at 18:01
  • (The answer about interpreting strides also may be helpful: http://stackoverflow.com/questions/34642595/tensorflow-strides-argument ) – dga Jul 05 '16 at 18:03
  • Thanks! I got an idea of how the implementation takes place. Any source for learning how to build cnn in TensorFlow? The one in their website seems bit complex to understand – Neel Shah Jul 05 '16 at 19:13
  • The convolutional MNIST example is probably the best start I know of, but past that, I think you might be best served asking a new question for that specific part. – dga Jul 05 '16 at 19:51
  • @dga can you take a look at this question, https://stackoverflow.com/q/58926940/5904928 why there is huge difference between lstm final output vs state output. – Aaditya Ura Nov 19 '19 at 05:11