How can LSTM attention have variable length input

Question

The attention mechanism of LSTM is a straight softmax feed forward network that takes in the hidden states of each time step of the encoder and the decoder's current state.

These 2 steps seems to contradict and can't wrap my head around: 1) The number of inputs to a feed forward network needs to be predefined 2) the number of hidden states of the encoder is variable (depends on number of time steps during encoding).

Am I misunderstanding something? Also would training be the same as if I were to train a regular encoder/decoder network or would I have to train the attention mechanism separately?

Thanks in Advance

Here's a nice visualization of attention that I came across: https://towardsdatascience.com/light-on-math-ml-attention-with-keras-dc8dbc1fad39 — David Parks, May 21 '19 at 16:02

Artur Lacerda · Accepted Answer · 2017-07-24T00:32:18.307

15

I asked myself the same thing today and found this question. I have never implemented an attention mechanism myself, but from this paper it seems a little bit more than just a straight softmax. For each output y_i of the decoder network, a context vector c_i is computed as a weighted sum of the encoder hidden states h₁, ..., h_T:

c_i = α_i1h₁+...+α_iTh_T

The number of time steps T may be different for each sample because the coefficients α_ij are not vector of fixed size. In fact, they are computed by softmax(e_i1, ..., e_iT), where each e_ij is the output of a neural network whose input is the encoder hidden state h_j and the decoder hidden state s_i-1:

e_ij = f(s_i-1, h_j)

Thus, before y_i is computed, this neural network must be evaluated T times, producing T weights α_i1,...,α_iT. Also, this tensorflow impementation might be useful.

edited Jul 24 '17 at 00:32

answered Jul 22 '17 at 02:49

Artur Lacerda

396
2
7

4

Congratulations on your first answer, which demonstrates research and is very well formatted! – trincot Jul 22 '17 at 22:44
1

I'm still a little confused, given that T is a variable number of inputs. After looking through the paper and the implementation you provided (thanks for that, great answer too by the way!), it seems like the solution is to simply fix an upper limit on the number of time steps T. In order to compute the alpha values, which requires a standard neural network layer transformation, we need to decide on a fixed number of alpha values to output from that transformation. I'd love to get a solid confirmation about this point though. It's been really hard to extrapolate from this paper and others. – David Parks Feb 01 '18 at 17:43
The output of the neural newtork f is a single coefficient e_ij. This NN is evaluated T times, and T can be arbitrary. The alpha values are the softmax of this T numbers. The sofmax operation takes N numbers and produces N numbers, and N doesn't have to be fixed. Therefore, there's no need for an upper bound on T. I hope I'm getting things right, because I've recently used a Keras attention layer (https://gist.github.com/cbaziotis/7ef97ccf71cbc14366835198c09809d2) which required a fixed T, so I had to pad the dataset. – Artur Lacerda Feb 02 '18 at 18:14
1

@DavidParks [Here](https://datascience.stackexchange.com/q/27217/67328) I've written a slightly different explanation, hope it complements this answer. – Esmailian Apr 25 '19 at 21:13

score 2 · Answer 2 · answered Mar 08 '18 at 03:18

def attention(inputs, size, scope):
    with tf.variable_scope(scope or 'attention') as scope:
        attention_context_vector = tf.get_variable(name='attention_context_vector',
                                             shape=[size],
                                             regularizer=layers.l2_regularizer(scale=L2_REG),
                                             dtype=tf.float32)
        input_projection = layers.fully_connected(inputs, size,
                                            activation_fn=tf.tanh,
                                            weights_regularizer=layers.l2_regularizer(scale=L2_REG))
        vector_attn = tf.reduce_sum(tf.multiply(input_projection, attention_context_vector), axis=2, keep_dims=True)
        attention_weights = tf.nn.softmax(vector_attn, dim=1)
        weighted_projection = tf.multiply(inputs, attention_weights)
        outputs = tf.reduce_sum(weighted_projection, axis=1)

return outputs

Hope this piece of codes can help you to understand how attention works。 I use this function in my doc classification jobs, which is a lstm-attention model, different from your encoder-decoder model.

How can LSTM attention have variable length input

2 Answers2