4

From the PyTorch Seq2Seq tutorial, http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#attention-decoder

We see that the attention mechanism is heavily reliant on the MAX_LENGTH parameter to determine the output dimensions of the attn -> attn_softmax -> attn_weights, i.e.

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

More specifically

self.attn = nn.Linear(self.hidden_size * 2, self.max_length)

I understand that the MAX_LENGTH variable is the mechanism to reduce the no. of parameters that needs to be trained in the AttentionDecoderRNN.

If we don't have a MAX_LENGTH pre-determined. What values should we initialize the attn layer with?

Would it be the output_size? If so, then that'll be learning the attention with respect to the full vocabulary in the target language. Isn't that the real intention of the Bahdanau (2015) attention paper?

Ioannis Nasios
  • 8,292
  • 4
  • 33
  • 55
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Also asked on https://discuss.pytorch.org/t/attentiondecoderrnn-without-max-length/13473 – alvas Feb 09 '18 at 04:06
  • Did you consider local attention instead of global? – Maxim Feb 14 '18 at 10:35
  • Not yet but if it's global, there must be some sort of a max? It's just for tractability no? Theoretically, it can do attention for all source words to all target words, right? It's just that if max_length = no. of target words, for the sentence pair any words that doesn't exist in the source will have zeros. – alvas Feb 15 '18 at 02:21

1 Answers1

5

Attention modulates the input to the decoder. That is attention modulates the encoded sequence which is of the same length as the input sequence. Thus, MAX_LENGTH should be the maximum sequence length of all your input sequences.

patapouf_ai
  • 17,605
  • 13
  • 92
  • 132