Understanding input/output tensors from tf.layers.conv2d

Question

I'm trying to understand the transformation performed by tf.layers.conv2d.

The mnist tutorial code from the TensorFlow website includes the convolution layer:

# Computes 64 features using a 5x5 filter.
# Padding is added to preserve width and height.
# Input Tensor Shape: [batch_size, 14, 14, 32]
# Output Tensor Shape: [batch_size, 14, 14, 64]
conv2 = tf.layers.conv2d(
    inputs=pool1,
    filters=64,
    kernel_size=[5, 5],
    padding="same",
    activation=tf.nn.relu)

However, my expectation is that the 32 input images would be multiplied by the number of filters, as each filter is applied to each image, to give an output tensor of [batch_sz, 14, 14, 2048]. Clearly this is wrong, but I don't know why. How does the transformation work? The API documentation tells me nothing about how it works. What would be the output if the input tensor was [batch_size, 14, 14, 48]?

score 1 · Answer 1 · answered May 12 '18 at 18:27

The output size depends on the input dimensions, the filter width, padding, and stride. You can evaluate conv2 (and any individual layer, at that) and then print the dimensions of the output to ensure they are what you think. You aren't required to call eval on solely the final layer, because tensorflow is much more flexible than that.

score 1 · Accepted Answer · answered May 12 '18 at 20:59

I think you might have a minor misunderstanding of how filter works here. This introduction and this answer provide some detailed explanation. I found the Convolution Demo animation in the introduction is extremely helpful in showing how it works.

The key point here is how the filter works. Usually, convolutional layer has a set of K filters (64 in your example). For each filter, the actual shape is kernel_size + depth_of_input (5x5x32 in your example). That means one filter will look/apply onto 32 channels/images all at once and gives one conclusion/computed_feature. Therefore, the depth/num_of_features of output is equal to your filters argument rather than input_depth*filters. Please check this code to get an idea about the real and final kernel for computation.

Therefore, to answer your last question, the output of either [batch_size, 14, 14, 32] or [batch_size, 14, 14, 48] will always be [batch_size, 14, 14, 64] for your setting.

Thanks Y. Luo, this was what I suspected but wasn't sure. It is annoying that the TF documentation doesnt explain it, at least as far as I can find. — tinyMind, May 12 '18 at 22:57
@tinyMind You are welcome. Sometimes details regarding to popular algorithms may be in the [tutorial](https://www.tensorflow.org/tutorials/layers#intro_to_convolutional_neural_networks). — Y. Luo, May 13 '18 at 00:11

Understanding input/output tensors from tf.layers.conv2d

2 Answers2