I am struggling with the concept of attention in the the context of autoencoders. I believe I understand the usage of attention with regards to seq2seq translation - after training the combined encoder and decoder, we can use both encoder and decoder to create (for example) a language translator. Because we are still using the decoder in production, we can take advantage of the attention mechanism.
However, what if the main goal of the autoencoder is mainly to produce a latent compressed representation of the input vector? I am talking about cases where we can essentially dispose of the decoder part of the model after training.
For example, if I use an LSTM without attention, the "classic" approach is to use the last hidden state as the context vector - it should represent the main features of my input sequence. If I were to use an LSTM with attention, my latent representation would have to be all hidden states per time step. This doesn't seem to fit into the notion of input compression and of keeping the main features. Its likely that the dimensionality may even be siginificantly higher.
Additionally, if I needed to use all hidden states as my latent representation (like in the attention case) - why use attention at all? I could just use all hidden states to initialize the decoder.