Does omitting pooling layers in CNNs make sense in some cases?

Question

I know that a usual CNN consists of both convolutional and pooling layers. Pooling layers make the output smaller which means less computations and they also make it somehow transform invariant, so the position of the feature from the kernel filter can be shifted in the original image a little bit.

But what happens when I don't use pooling layers? The reason could be that I want a feature vector for each pixel from the original image, so the output of the convolutional layers has to be of the same size as the image, just having more channels. Does this make sense? Will there be still the useful information in these feature vectors or having the pooling layers in CNNs is necessary? Or are there some approaches to get feature vectors of individual pixels with pooling layers?

score 1 · Accepted Answer · answered Nov 01 '17 at 18:34

1

Convolutional feature maps, early and later ones, contain a lot of useful information. Many interesting and fun applications are based exactly on the feature maps from the pre-trained CNNs, e.g. Google Deep Dream and Neural Style. A common choice for a pre-trained model is VGGNet for its simplicity.

Also note that some CNNs, e.g. All Convolutional Net, replace pooling layers with convolutional ones. They still do downsampling through striding, but completely avoid maxpool or avgpool operations. This idea has become popular and applied in many modern CNN architectures.

The only difficulty is that CNN without downsampling may be harder to train. You need enough training data, where labels are images (I assume you have), and you'd also need some clever loss function for backpropagation. Of course, you can start with L2 norm of pixel difference, but it really depends on the problem you're solving.

My recommendation would be to take an existing pre-trained CNN (e.g. VGGNet for tensorflow) and leave just first two convolutional layers, up until the first downsampling. This is a fast way to try this kind of architecture.

answered Nov 01 '17 at 18:34

Maxim

52,561
27
155
209

These are CNNs with no pooling layers but still some downsampling. What if I wouldn't use any downsampling at all? So if the input is a 256x256x3 image, the output will be 256x256xN (a feature vector of the size N for each pixel). Would it still have the information and be transform invariant? My point would be to detect a certain pixel in the image with a specific neighborhood using this feature vector. I don't want to use a traditional sliding window method because the texture of the image as a whole leads to this pixel and this information would be lost if I use just sliding window I think. – T.Poe Nov 01 '17 at 19:07
The early conv layers probably won't be transform invariant, that's true, but still contain useful information - that's my point. It's hard to day in advance if early feature maps will work for *your data* and *your task* or not. But if not, you can also consider downsampling + upsampling with convolution to get the feature map of the size you need. – Maxim Nov 02 '17 at 16:32
It's maybe for another question, but don't you know about some models with more than two layers before first downsampling? (already trained like the VGGNet you posted in the answer) @Maxim – T.Poe Nov 06 '17 at 12:00
All famous networks that I know use 1-2 layers before the downsampling, I think because their main goal is classification accuracy and 2 layers is flexible enough: more flexibility there doesn't lead to higher accuracy. But they often use more layers later in the network – Maxim Nov 06 '17 at 12:32

score 0 · Answer 2 · answered Sep 21 '22 at 09:21

Q1) I want to have features from each pixel in image.

Answer: Image pixels are generally highly coorelated and it is almost never useful to have feature from each pixel, thus one important way to see pooing is along with compensating for the zitter it picks only one pixel of highly correlated spatial lot. But that being said it is not necessary at all.

Q2: Will there be useful information in these feature vectors?

Answer: We generally make Height and Width of image smaller down the layers and increase the channels, the reasoning being the bigger the Height x Width as output of a layer the more explicit the spatial information would be. Thus making the output of each layer deeper in the channel but smaller in size makes it to loose spatial information and have more semantically appropriate encoding independent of position. like identifying whether an image has cat should not depend on where is the cat or the size of cat. But presence of the features of cat like the pattern of ear, color patterns, fur etc.

But you can do this same sampling of less correlated pixels by convolution by a stride > 1.

Our main idea is to have same information flow through to the network that I is being contained in the image but have smaller size deeper representations.

Also the application makes lot of difference, like denoising needs you to construct the image exactly and thus you are limited to how small you can become as you have to reconstruct.

Classification flattens the image map, and thus there are separate considerations.

Also note the amount of information in an image map throughout the CNN ican be preserved as that can be (HxWxchannels) and you can keep the product same while decreasing/increasing the multiplicants.

the last line, get feature vectors of individual pixels does not appeal much sense to me. Pixels are themselves some low level features and in general we are interested to get high level features of a collection of pixels. If you wan to preserve pixel-fetures better just use the pixel values. Or am I missing something?

Does omitting pooling layers in CNNs make sense in some cases?

2 Answers2