Effect of max_pool in Convolutional Neural Network [tensorflow]

Question

I'm following Udacity Deep Learning video by Vincent Vanhoucke and trying to understand the (practical or intuitive or obvious) effect of max pooling.

Let's say my current model (without pooling) uses convolutions with stride 2 to reduce the dimensionality.

  def model(data):
    conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer1_biases)
    conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer2_biases)
    shape = hidden.get_shape().as_list()
    reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
    hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
    return tf.matmul(hidden, layer4_weights) + layer4_biases

Now I introduced pooling: Replace the strides by a max pooling operation (nn.max_pool()) of stride 2 and kernel size 2.

  def model(data):
    conv1 = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME')
    bias1 = tf.nn.relu(conv1 + layer1_biases)
    pool1 = tf.nn.max_pool(bias1, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')
    conv2 = tf.nn.conv2d(pool1, layer2_weights, [1, 1, 1, 1], padding='SAME')
    bias2 = tf.nn.relu(conv2 + layer2_biases)
    pool2 = tf.nn.max_pool(bias2, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')
    shape = pool2.get_shape().as_list()
    reshape = tf.reshape(pool2, [shape[0], shape[1] * shape[2] * shape[3]])
    hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
    return tf.matmul(hidden, layer4_weights) + layer4_biases

What would be the compelling reason that we use the later model instead of no-pool model, besides the improved accuracy? Would love to have some insights from people who have already used cnn many times!

In your example, don't you want to return `tf.matmul(poo2, ...` otherwise the network does not contain the last pooling layer, right? — Soerendip, Dec 14 '17 at 13:10

Salvador Dali · Accepted Answer · 2017-06-26T04:32:56.920

4

Both of the approaches (strides and pooling) reduces the dimensionality of the input (for strides/pooling size > 1). This by itself is a good thing because it reduces the computation time, number of parameters and allows to prevent overfitting.

They achieve it in a different way:

you can think about strides as downsampling the result of the 1-strided convolution by just taking every s-th result.
max-pooling downsamples the result by taking the maximum number from a hypercube. If some important feature has been found, max-pool preserves it regardless of its position

You also mentioned "besides the improved accuracy". But almost everything people do in machine learning is to improve the accuracy (some other loss function). So if tomorrow someone will show that sum-square-root pooling achieves the best result on many bechmarks, a lot of people will start to use it.

edited Jun 26 '17 at 04:32

answered Jun 25 '17 at 22:42

Salvador Dali

214,103
147
703
753

Thank you Salvador! We are lucky to have you! – aerin Jun 26 '17 at 01:42
"Both of the approaches (strides and pooling) reduces the dimensionality" - wrong. Only the strides > 1 reduce the dimensionality. You can have pooling with stride = 1. (Otherwise a good answer) – Martin Thoma Jun 26 '17 at 04:24
@MartinThoma this is kind of a nitpicking. The same way you can say that pooling does not reduce the dimensionality. Because technically you can use pooling of 1 element (which does nothing) and it also does not reduce dims. Anyway, added a comment about > 1. – Salvador Dali Jun 26 '17 at 04:32
"The same way you can say that pooling does not reduce the dimensionality." - that is exactly what I would say. I don't think it is nitpicking. It is an important information which might not be clear to people who are new to neural networks and it is easy to write / say. +1 – Martin Thoma Jun 26 '17 at 04:50

score 1 · Answer 2 · answered Jun 26 '17 at 12:57

In a classification task improving the accuracy is the goal.

However, pooling allows you to:

Reduce the input dimensionality
Force the network to learn particular features, depending on the type of pooling you apply.

Reducing the input dimensionality is something you want because it forces the network to project its learned representations in a different and with lower dimensionality space. This is good computationally speaking because you have to allocate less memory and thus you can have bigger batches. But it's also something desirable because usually high-dimensional spaces have a lot of redundancy and are spaces in which all abjects appears to be sparse and dissimilar ( see The curse of dimensionality )

The function you decide to use for the pooling operation, moreover, can force the network to give more importance to some features.

Max-pooling, for instance, is widely used because allow the network to be robust to small variations of the input image.

What happens, in practice, it that only the features with the highest activations pass through the max-pooling gate. If the input image is shifted by a small amount, then the max-pooling op produces the same output although the input is shifted (the maximum shift is thus equal to the kernel size).

CNN without pooling are also capable of learning this kind of features, but with a bigger cost in term of parameters and computing time (see Striving for Simplicity: The All Convolutional Net)

Doesnt the paper 'Striving for Simplicity' meant to imply that CNN without pooling actually is 'lesser' in terms of cost and computing time without much loss is accuracy? In the paper the conv+pooling is replaced with conv with stride. — Vijay Mariappan, Jun 26 '17 at 16:23
Yes, they replaced an operation that has 0 parameters to learn with an operation that has `kernel_width² x kernel_depth x number_of_kernels` parameters to learn. — nessuno, Jun 26 '17 at 16:45
They dont replace the pooling layer with conv. They replace the conv+pooling (together) with conv_with_stride. — Vijay Mariappan, Jun 26 '17 at 16:52
Are you sure? Read the numbered list on page 2 (that says that they could choose if removing the layer as you told me, or replacing with a convolution (point 1 and 2). Then they say that the first point has some downside. Scroll down, then, to page 5 and look at table 3. The model A,B and C increase the number of parameters while becoming "all convolutional" — nessuno, Jun 26 '17 at 17:02
yes you are right, looks like they have tried both the options. And the best results they get is with replacing pooling with cnn_with_stride. — Vijay Mariappan, Jun 26 '17 at 17:18

Effect of max_pool in Convolutional Neural Network [tensorflow]

2 Answers2