6

I am a little confused with the difference between conv2d and conv3d functions. For example, if I have a stack of N images with H height and W width, and 3 RGB channels. The input to the network can be two forms form1: (batch_size, N, H, W, 3) this is a rank 5 tensor form2: (batch_size, H, W, 3N ) this is a rank 4 tensor

The question is, if I apply conv3d with M filters with size (N,3,3) to form1 and apply conv2d with M filters with size (3,3)

Do they have basicly the same feature operations? I think both of these forms convolve in temporal and spatial dimension.

I really appreciate if anyone can help me figure this out.

Jiaju Orange Yue
  • 117
  • 1
  • 1
  • 6

1 Answers1

4

If you have a stack of images, you have a video. You can not have two input forms. You have either images or videos. For the video case you can use 3D convolution and 2D convolution is not defined for it. If you stack the channels as you mentioned it (3N) the 2D convolution will interpret the stack as one image with a lot of channels, but not as stack.

Note here that a 2D convolution with (batch, H, W, Channels) is the same as an 3D convolution with (batch, H, W, Channels, 1).

Lau
  • 1,353
  • 7
  • 26
  • I understand what you said. In form2, it is true that I stack the video sequences into one BIG image with 3N channels. But this 3N-channel image has the same temporal information as the image sequence. When you apply conv2d, it still convolve the temporal information together. So I think they have basicly the same effect in this case. My confusion is on this point. Do you think it is correct? – Jiaju Orange Yue Nov 06 '18 at 17:23
  • 3
    Yes, but with 3D convolution you can move your filter over different images. Instead of all at once. E.g. you can analyzed the first two then the second and third and then the third and fourth image in your video when you choose a [H, W, 2] filter for your 3D convolution. This is the idea behind 3D convolution. If you choose the third weight of the 3D convolution equals to the number of channel, it is the same as 2D convolution. Does this answer your question? – Lau Nov 06 '18 at 19:36
  • *If you have a stack of images, you have a video.* — or multiple channels? – gerrit Jun 26 '19 at 16:02
  • @gerrit It depends on how you stack them. If you create a new axis (H, W, C, N) you have a video. So you should use stack them in the N axis. If you stack them in the C dimension, you just have more channels – Lau Jun 28 '19 at 04:25