2

So I've used RNN/LSTMs in three different capacities:

  1. Many to many: Use every output of the final layer to predict the next. Could be classification or regression.
  2. Many to one: Use the final hidden state to perform regression or classification.
  3. One to many: Take a latent space vector, perhaps the final hidden state of an LSTM encoder and use it to generate a sequence (I've done this in the form of an autoencoder).

In none of these cases do I use the intermediate hidden states to generate my final output. Only the last layer outputs in case #1 and only the last layer hidden state in case #2 and #3. However, PyTorch nn.LSTM/RNN returns a vector containing the final hidden state of every layer, so I assume they have some uses.

I'm wondering what some use cases of those intermediate layer states are?

rocksNwaves
  • 5,331
  • 4
  • 38
  • 77

1 Answers1

1

There’s nothing explicitly requiring you to use the last layer only. You could feed in all of the layers to your final classifier MLP for each position in the sequence (or at the end, if you’re classifying the whole sequence).

As a practical example, consider the ELMo architecture for generating contextualized (that is, token-level) word embeddings. (Paper here: https://www.aclweb.org/anthology/N18-1202/) The representations are the hidden states of a multi-layer biRNN. Figure 2 in the paper shows how different layers differ in usefulness depending on the task. The authors suggest that lower levels encode syntax, while higher levels encode semantics.

Arya McCarthy
  • 8,554
  • 4
  • 34
  • 56
  • I mean layer in the sense of stacked RNNs. For input of length `T`, an RNN with `N` layers gives you two outputs: First, the `output` of layer `N` at each timestep `t` in `[1, T]`. Second, `h_n`, which is the vector `[h_T_1, ..., h_T_N]`. It's the final hidden state of each layer from 1 to N. I believe that is the convention, as illustrated well by this visual representation of the PyTorch documentation: https://stackoverflow.com/a/48305882/3696204 So what I'm asking about is specifically `h_n = [h_T_1, ... , h_T_N]` Could you edit your answer so that I know we are on the same page. – rocksNwaves Feb 28 '21 at 14:46
  • Yeah, that’s the natural way to use the term. My remarks about the popular word embedding method ELMo apply in this case. – Arya McCarthy Feb 28 '21 at 14:49
  • Okay so your answer stands? I just wanted to make sure we were speaking the same language before I continued. – rocksNwaves Feb 28 '21 at 14:50
  • Yup! We are. I removed the irrelevant second half of the answer. – Arya McCarthy Feb 28 '21 at 14:53
  • Awesome, thanks Arya. I've added the paper you linked to my growing bookmarks folder of "papers to read". – rocksNwaves Feb 28 '21 at 14:54