Understanding 3D convolution and when to use it?

Question

I am new to convolutional neural networks, and I am learning 3D convolution. What I could understand is that 2D convolution gives us relationships between low level features in the X-Y dimension, while the 3D convolution helps detect low level features and relationships between them in all the 3 dimensions.

Consider a CNN employing 2D conv layers to recognize hand written digits. If a digit, say 5, was written in different colors:

Would a strictly 2D CNN would perform poorly (since they belong to different channels in the z dimension)?

Also, are there practical well-known neural nets that employ 3D convolution?

This is a kind of theory question, it is more appropriate for the other SE sites like AI or Data Science. — Dr. Snoopy, Jul 25 '19 at 11:11
I guess you're right. How do I migrate the question to those sites? — Shobhit Verma, Jul 26 '19 at 13:49

Him · Accepted Answer · 2019-08-06T12:08:17.057

The problem is that the 2D aspects of an image have locality. In a sense, things that are nearby are expected to be related in some fundamental way. E.g. a pixel near a hair pixel is expected to be a hair pixel, a priori. However, the different channels have no such relationship. When you only have 3 channels, a 3D convolution is equivalent to being fully connected in z. When you have 27 channels (e.g. in the middle of the net), why would any 3 channels be considered "close" to each other?

This answer explains the difference nicely.

Doing a "fully-connected" relationship over the channels is what most libraries do by default. Note this line in particular: "...a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels]". For an input vector of size in_channels, a matrix of size [in_channels, out_channels] is fully-connected. So, the filter can be thought of as a fully-connected layer on a "patch" of image size [filter_height, filter_width].

To illustrate, on a single channel, a regular plain old image filter takes a patch of image and maps that patch to a single pixel in a new image. Like so: (image credit)

On the other hand, suppose that we have multiple channels. Instead of performing a linear mapping from a 3x3 patch to a 1x1 pixel, we perform a linear mapping from a 3x3xin_channels patch to a 1x1xout_channels set of pixels. How do we do this? Well, a linear mapping is just a matrix. Note that a 3x3xin_channels patch can be written as a vector with 3*3*in_channels entries. A 1x1xout_channels set of pixels can be written as a vector with out_channels entries. A linear mapping between the two is given by a matrix with 3*3*in_channels rows and out_channels columns. The entries of that matrix are the parameters of that layer of the network. The layer works by simply multiplying the in vector by the matrix of weights to get the out vector. This is repeated over all patches of an image. (Actually, instead of doing this in a loop over all patches, you can achieve an equivalent thing with some fanciness which is what libraries do in practice, but it gives the same result)

To illustrate, the mapping takes this 3x3xin_channels column:

To this 1x1xout_channels stack of pixels:

Now, what you are proposing is to do something with the following bit:

There is no mathematical reason why you can't do something with that 3x3x3 patch containing only 3 channels of your whole set of in_channels. However, whatever 3 channels you choose is totally arbitrary, and they have no intrinsic relationship to one another that would suggest that treating them as being "nearby" would help.

To reiterate, in an image, the pixels that are near each other are expected to be "similar" or "related" in some sense. This is why a convolution works at all. If you jumbled up the pixels and then did a convolution, it would be worthless. On that note, all of the channels are just a jumble. There is no "nearby relatedness" property along the channels. E.g. the "red" channel isn't near the "green" channel OR the "blue" channel, because "nearness" doesn't make any sense between the channels. Since "nearness" isn't a property of the channel dimension, then doing a convolution in that dimension probably isn't going to be useful.

On the other hand, we can simply take the input of ALL of the in_channels to generate the output from ALL of the out_channels simultaneously, and let them influence each other in a linear sort of way. Note that the linear transformation described involves a sort of cross-pollination of the parameters. For example, for a layer at the top of the network, taking in a 3x3 patch of r,g,b channels labeled r_1_1-r_3_3 etc., a single pixel in a single channel of the output from that patch would look like:

A*r_1_1 + B*r_1_2 + ... C*r_3_3 + D*b_1_1 + E*b_1_2 + ... F*b_3_3 + G*g_1_1 + ...

Where the capital letters are entries of the weight matrix.

So your observation: "Would a strictly 2D CNN would perform poorly?" is based on an assumption that the convolutional layer doesn't include any "mixing" between the various channels. This is not the case. The in_channels are ALL combined in a linear mapping to obtain the out_channels.

Thanks for your answer, Scott. I did not quite understand the point 'the filter can be thought of as a fully-connected layer on a "patch" of image size'. — Shobhit Verma, Jul 25 '19 at 08:38
Are you familiar with [fully-connected networks?](https://www.oreilly.com/library/view/tensorflow-for-deep/9781491980446/ch04.html) — Him, Jul 25 '19 at 08:47
Yes, but I never understood how fully connected layers can be seen as extensions of Convolutional layers. — Shobhit Verma, Jul 25 '19 at 09:02
I have attempted to expand upon my answer. I'm not sure what your background is, but you might find some review of [linear transformations](https://en.wikipedia.org/wiki/Transformation_matrix) helpful. Note that the transformations described here are very high dimensional, but you can [see here](https://en.wikipedia.org/wiki/File:2D_affine_transformation_matrix.svg) that in a linear transformation, the "out" dimensions often include components that are a "mixture" of all of the "in" dimensions. — Him, Aug 06 '19 at 00:15

Understanding 3D convolution and when to use it?

1 Answers1