11

The attention mechanism of LSTM is a straight softmax feed forward network that takes in the hidden states of each time step of the encoder and the decoder's current state.

These 2 steps seems to contradict and can't wrap my head around: 1) The number of inputs to a feed forward network needs to be predefined 2) the number of hidden states of the encoder is variable (depends on number of time steps during encoding).

Am I misunderstanding something? Also would training be the same as if I were to train a regular encoder/decoder network or would I have to train the attention mechanism separately?

Thanks in Advance

Gulzar
  • 23,452
  • 27
  • 113
  • 201
Andrew Tu
  • 258
  • 3
  • 8
  • Here's a nice visualization of attention that I came across: https://towardsdatascience.com/light-on-math-ml-attention-with-keras-dc8dbc1fad39 – David Parks May 21 '19 at 16:02

2 Answers2

15

I asked myself the same thing today and found this question. I have never implemented an attention mechanism myself, but from this paper it seems a little bit more than just a straight softmax. For each output yi of the decoder network, a context vector ci is computed as a weighted sum of the encoder hidden states h1, ..., hT:

ci = αi1h1+...+αiThT

The number of time steps T may be different for each sample because the coefficients αij are not vector of fixed size. In fact, they are computed by softmax(ei1, ..., eiT), where each eij is the output of a neural network whose input is the encoder hidden state hj and the decoder hidden state si-1:

eij = f(si-1, hj)

Thus, before yi is computed, this neural network must be evaluated T times, producing T weights αi1,...,αiT. Also, this tensorflow impementation might be useful.

Artur Lacerda
  • 396
  • 2
  • 7
  • 4
    Congratulations on your first answer, which demonstrates research and is very well formatted! – trincot Jul 22 '17 at 22:44
  • 1
    I'm still a little confused, given that T is a variable number of inputs. After looking through the paper and the implementation you provided (thanks for that, great answer too by the way!), it seems like the solution is to simply fix an upper limit on the number of time steps T. In order to compute the alpha values, which requires a standard neural network layer transformation, we need to decide on a fixed number of alpha values to output from that transformation. I'd love to get a solid confirmation about this point though. It's been really hard to extrapolate from this paper and others. – David Parks Feb 01 '18 at 17:43
  • The output of the neural newtork f is a single coefficient e_ij. This NN is evaluated T times, and T can be arbitrary. The alpha values are the softmax of this T numbers. The sofmax operation takes N numbers and produces N numbers, and N doesn't have to be fixed. Therefore, there's no need for an upper bound on T. I hope I'm getting things right, because I've recently used a Keras attention layer (https://gist.github.com/cbaziotis/7ef97ccf71cbc14366835198c09809d2) which required a fixed T, so I had to pad the dataset. – Artur Lacerda Feb 02 '18 at 18:14
  • 1
    @DavidParks [Here](https://datascience.stackexchange.com/q/27217/67328) I've written a slightly different explanation, hope it complements this answer. – Esmailian Apr 25 '19 at 21:13
2
def attention(inputs, size, scope):
    with tf.variable_scope(scope or 'attention') as scope:
        attention_context_vector = tf.get_variable(name='attention_context_vector',
                                             shape=[size],
                                             regularizer=layers.l2_regularizer(scale=L2_REG),
                                             dtype=tf.float32)
        input_projection = layers.fully_connected(inputs, size,
                                            activation_fn=tf.tanh,
                                            weights_regularizer=layers.l2_regularizer(scale=L2_REG))
        vector_attn = tf.reduce_sum(tf.multiply(input_projection, attention_context_vector), axis=2, keep_dims=True)
        attention_weights = tf.nn.softmax(vector_attn, dim=1)
        weighted_projection = tf.multiply(inputs, attention_weights)
        outputs = tf.reduce_sum(weighted_projection, axis=1)

return outputs

Hope this piece of codes can help you to understand how attention works。 I use this function in my doc classification jobs, which is a lstm-attention model, different from your encoder-decoder model.