In the Stack Overflow thread How can i add a Bi-LSTM layer on top of bert model?, there is a line of code:
hidden = torch.cat((lstm_output[:,-1, :256],lstm_output[:,0, 256:]),dim=-1)
Can someone explain why the concatenation of last and first tokens and not any other? What would these two tokens contain that they were chosen?