AttentionDecoderRNN without MAX_LENGTH

Question

From the PyTorch Seq2Seq tutorial, http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#attention-decoder

We see that the attention mechanism is heavily reliant on the MAX_LENGTH parameter to determine the output dimensions of the attn -> attn_softmax -> attn_weights, i.e.

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

More specifically

self.attn = nn.Linear(self.hidden_size * 2, self.max_length)

I understand that the MAX_LENGTH variable is the mechanism to reduce the no. of parameters that needs to be trained in the AttentionDecoderRNN.

If we don't have a MAX_LENGTH pre-determined. What values should we initialize the attn layer with?

Would it be the output_size? If so, then that'll be learning the attention with respect to the full vocabulary in the target language. Isn't that the real intention of the Bahdanau (2015) attention paper?

Also asked on https://discuss.pytorch.org/t/attentiondecoderrnn-without-max-length/13473 — alvas, Feb 09 '18 at 04:06
Not yet but if it's global, there must be some sort of a max? It's just for tractability no? Theoretically, it can do attention for all source words to all target words, right? It's just that if max_length = no. of target words, for the sentence pair any words that doesn't exist in the source will have zeros. — alvas, Feb 15 '18 at 02:21

score 5 · Answer 1 · answered Feb 19 '18 at 13:12

5

Attention modulates the input to the decoder. That is attention modulates the encoded sequence which is of the same length as the input sequence. Thus, MAX_LENGTH should be the maximum sequence length of all your input sequences.

answered Feb 19 '18 at 13:12

patapouf_ai

17,605
13
92
132

AttentionDecoderRNN without MAX_LENGTH

1 Answers1