2

I going through a series of machine learning examples that use RNNs for document classification (many-to-one). In most tutorials, the RNN output of the last time step is used, i.e., fed into one or more dense layers to map it to the number of classes (e.g., [1], [2]).

However, I also came across some examples where, instead of the last output, the average of the outputs over all time steps is used (mean pooling?, e.g., [3]). The dimensions of this averaged output are of course the same as for the last output. So computationally, both approaches just work the same.

My questions is now, what is the intuition between the two different approaches. Due to recursive nature, the last output also reflects the output of the previous time steps. So why the idea of averaging the RNN outputs over all time steps. When to use what?

Christian
  • 3,239
  • 5
  • 38
  • 79

1 Answers1

1

Pooling over time is a specific technique that is used to extract the features from the input sequence. From this question:

The reason to do this, instead of "down-sampling" the sentence like in a CNN, is that in NLP the sentences naturally have different length in a corpus. This makes the feature maps different for different sentences, but we'd like to reduce the tensor to a fixed size to apply softmax or regression head in the end. As stated in the paper, it allows to capture the most important feature, one with the highest value for each feature map.

It's important to note here that max-over-time (or average-over-time) is usually an intermediate layer. In particular, there can be several of them in a row or in parallel (with different window size). The end result produced by the network can still be either many-to-one or many-to-many (at least in theory).

However, in most of the cases, there is a single output from the RNN. If the output must be a sequence, this output is usually fed into another RNN. So it all boils down to how exactly this single value is learned: take the last cell output or aggregate across the whole sequence or apply attention mechanism, etc.

Maxim
  • 52,561
  • 27
  • 155
  • 209
  • I get that reducing sequences of variable lengths to a fixed tensor is mostly needed, and that this can be done by using the last cell output or an aggregate over whole output sequence. But is the difference, intuitively speaking? Are there cases where one approaches is preferable compared to the other? – Christian May 10 '18 at 13:27
  • @Christian this is a general question, can also be stated like this: when is CNN better than RNN? I'd say that for tasks where feature detection in text is more important, for example, searching for angry terms, sadness, abuses, named entities etc. CNN work better, whereas RNN fits better when the information is spread across the whole sequence (translation). I've recently tried both for programming language detector. Surprisingly CNN turned out to be better. https://github.com/maxim5/code-inspector – Maxim May 10 '18 at 13:35
  • Since CNNs and RNNs are rather different networks structures, I can kind of see in which cases which network might be more suitable than the other (when it comes to text). When it comes to the ways to reducing a variable RNN output to a fixed tensor, making a choice seems much less obvious to me. My current take-away message is, that there is a lot of try and error involved. Right? Not very satisfiable :). – Christian May 10 '18 at 14:03
  • I see your point. If you are interested specifically in output reduction, I can say that averaging of cell outputs is much more rare, especially in NLP, because the key term can be anywhere in a sequence, so the cells that haven't seen it are practically irrelevant. Regarding your last point: I'm afraid that's mostly how it is (it was even formalized into *no free lunch theorem*) and often the data defines what approach is going to work better. – Maxim May 10 '18 at 15:03