17

I am struggling with the concept of attention in the the context of autoencoders. I believe I understand the usage of attention with regards to seq2seq translation - after training the combined encoder and decoder, we can use both encoder and decoder to create (for example) a language translator. Because we are still using the decoder in production, we can take advantage of the attention mechanism.

However, what if the main goal of the autoencoder is mainly to produce a latent compressed representation of the input vector? I am talking about cases where we can essentially dispose of the decoder part of the model after training.

For example, if I use an LSTM without attention, the "classic" approach is to use the last hidden state as the context vector - it should represent the main features of my input sequence. If I were to use an LSTM with attention, my latent representation would have to be all hidden states per time step. This doesn't seem to fit into the notion of input compression and of keeping the main features. Its likely that the dimensionality may even be siginificantly higher.

Additionally, if I needed to use all hidden states as my latent representation (like in the attention case) - why use attention at all? I could just use all hidden states to initialize the decoder.

user3641187
  • 405
  • 5
  • 10
  • you can add a small feed-forward layer after the big hidden states to reduce the dimension – Hai Feng Kao Sep 03 '20 at 18:07
  • Yes, but that seems to defeat the entire point of attention to begin with. Attention is about knowing which hidden states are relevant given the context. Adding a linear dimension will perform a static choice of importance. And given the recursive nature of an LSTM, the first hidden layer should be optimal for the recursion during decoding. So why even use attention to begin with? – user3641187 Sep 08 '20 at 13:45

2 Answers2

1

The answer depends very much on what you aim to use the representation from the autoencoder for. Each autoencoder needs something that makes the autoencoding task hard, so it needs a rich intermediate representation to solve the task. It can be either a bottleneck in the architecture (as in the case of the vanilla encoder-decoder model) or adding noise in the source side (you can view BERT as a special case of denoising autoencoder where some input tokens are masked).

If you do not introduce any noise on the source side, the autoencoder would learn to simply copy the input without learning anything beyond the identity of input/output symbols – the attention would break the bottleneck property of the vanilla model. The same holds also for the case of labeling the encoder states.

There are sequence-to-sequence autoencoders (BART, MASS) that use encoder-decoder attention. The generated noise includes masking and randomly permuting tokens. The representation that they learn is then more suitable for sequence-to-sequence tasks (such as text summarization or low-resource machine translation) than representations from encoder-only models such as BERT.

Jindřich
  • 10,270
  • 2
  • 23
  • 44
0

"Attention is proposed as a solution to the limitation of the Encoder-Decoder model encoding the input sequence to one fixed length vector from which to decode each output time step. This issue is believed to be more of a problem when decoding long sequences"

https://machinelearningmastery.com/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks/

It simply a means to improve on "without attention" architecture when working with long sequence where the compressed representation might become insufficient.

If I were to use an LSTM with attention, my latent representation would have to be all hidden states per time step. This doesn't seem to fit into the notion of input compression and of keeping the main features

Undercomplete latent representation is one way of regularizing autoencoders to force them to extract relevant features, but it's not a necessary condition. Overcomplete autoencoders (with higher dimension latent representation + regularization) can also successfully learn relevant features.

If you want to know more you can read: Deep Learning (Ian Goodfellow) - Chapter 14.

Yoan B. M.Sc
  • 1,485
  • 5
  • 18