I going through a series of machine learning examples that use RNNs for document classification (many-to-one). In most tutorials, the RNN output of the last time step is used, i.e., fed into one or more dense layers to map it to the number of classes (e.g., [1], [2]).
However, I also came across some examples where, instead of the last output, the average of the outputs over all time steps is used (mean pooling?, e.g., [3]). The dimensions of this averaged output are of course the same as for the last output. So computationally, both approaches just work the same.
My questions is now, what is the intuition between the two different approaches. Due to recursive nature, the last output also reflects the output of the previous time steps. So why the idea of averaging the RNN outputs over all time steps. When to use what?