I am learning about attention models and its implementations in keras. While searching I came across these two methods first and second using which we can create an attention layer in keras
# First method
class Attention(tf.keras.Model):
def __init__(self, units):
super(Attention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, features, hidden):
hidden_with_time_axis = tf.expand_dims(hidden, 1)
score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))
attention_weights = tf.nn.softmax(self.V(score), axis=1)
context_vector = attention_weights * features
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights
# Second method
activations = LSTM(units, return_sequences=True)(embedded)
# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)
sent_representation = merge([activations, attention], mode='mul')
The math behind attention model is
If we look at the first metod it was somewhat a direct implementation of the attention math whereas the second method which has more number of hits in internet is not.
My real doubt is in these lines in the second method
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)
sent_representation = merge([activations, attention], mode='mul')
- Which is the right implementation for attention?
- What is the intution behind
RepeatVector
andPermute
layer in second method? - In the first method
W1
,W2
are weights; why is a dense layer is consider as weights here? - Why is the
V
value is considered as a single unit dense layer? - What is
V(score)
do?