How is attention layer implemented in keras?

Question

I am learning about attention models and its implementations in keras. While searching I came across these two methods first and second using which we can create an attention layer in keras

# First method

class Attention(tf.keras.Model):
    def __init__(self, units):
        super(Attention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, features, hidden):
        hidden_with_time_axis = tf.expand_dims(hidden, 1)
        score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))
        attention_weights = tf.nn.softmax(self.V(score), axis=1)
        context_vector = attention_weights * features
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector, attention_weights

# Second method

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)

sent_representation = merge([activations, attention], mode='mul')

The math behind attention model is

If we look at the first metod it was somewhat a direct implementation of the attention math whereas the second method which has more number of hits in internet is not.

My real doubt is in these lines in the second method

attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)
sent_representation = merge([activations, attention], mode='mul')

Which is the right implementation for attention?
What is the intution behind RepeatVector and Permute layer in second method?
In the first method W1,W2 are weights; why is a dense layer is consider as weights here?
Why is the V value is considered as a single unit dense layer?
What is V(score) do?

here a simple implementation of attention: https://stackoverflow.com/questions/62948332/how-to-add-attention-layer-to-a-bi-lstm/62949137#62949137 — Marco Cerliani, Jul 17 '20 at 14:57

score 2 · Accepted Answer · answered Jul 11 '19 at 12:15

Which is the right implementation for attention?

I'd recommend the following:

https://github.com/tensorflow/models/blob/master/official/transformer/model/attention_layer.py#L24

The multi-header Attention layer above implements an nifty trick: it reshapes the matrix so that instead of it being shaped as (batch_size, time_steps, features) it is shaped as (batch_size, heads, time_steps, features / heads) and then it performs a computation on the "features / heads" block.

What is the intution behind RepeatVector and Permute layer in second method?

Your code is incomplete... there is a matrix multiplication missing in your code (you don't show the Attention layer being used). That probably modified the shape of the result and this code is trying to somehow recover the right shape. It is probably not the best approach.

In the first method W1,W2 are weights; why is a dense layer is consider as weights here?

A Dense layer is a set of weights... Your question is a bit vague.

Why is the V value is considered as a single unit dense layer?

That is a very odd choice that doesn't match my reading of the paper nor the implementations that I've seen.

Hey thank you for the answer. `Your code is incomplete... there is a matrix multiplication missing ....` i took the code from here https://stackoverflow.com/q/42918446/996366 . Can you please tell me whats the difference between an ordinary attention and a multi headed attention? — Eka, Jul 12 '19 at 05:07
Multi-headed attention means that there are multiple "queries" performed from a time_step into the global (all time steps) state. The "queries" are also learnt. Each of the heads will tend to focus on different characteristics of the sentence. Of course this is all pure guessing... but you can image the heads focusing on multiple characteristics of the sentence (e.g. subject, verb, target) in order to understand how to translate from one language to another, for example. — Pedro Marques, Jul 12 '19 at 12:22

How is attention layer implemented in keras?

1 Answers1