Questions regarding attention model mechanism in deep learning
Questions tagged [attention-model]
389 questions
36
votes
5 answers
What is the difference between Luong attention and Bahdanau attention?
These two attentions are used in seq2seq modules. The two different attentions are introduced as multiplicative and additive attentions in this TensorFlow documentation. What is the difference?

Shamane Siriwardhana
- 3,951
- 6
- 33
- 73
29
votes
3 answers
How to understand masked multi-head attention in transformer
I'm currently studying code of transformer, but I can not understand the masked multi-head of decoder. The paper said that it is to prevent you from seeing the generating word, but I can not unserstand if the words after generating word have not…

Neptuner
- 291
- 1
- 3
- 3
20
votes
2 answers
what the difference between att_mask and key_padding_mask in MultiHeadAttnetion
What the difference between att_mask and key_padding_mask in MultiHeadAttnetion of pytorch:
key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention. When given a binary mask and a value is True, the…

one
- 2,205
- 1
- 15
- 37
17
votes
1 answer
Adding Attention on top of simple LSTM layer in Tensorflow 2.0
I have a simple network of one LSTM and two Dense layers as such:
model = tf.keras.Sequential()
model.add(layers.LSTM(20, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(layers.Dense(20, activation='sigmoid'))
model.add(layers.Dense(1,…

greco.roamin
- 799
- 1
- 6
- 20
17
votes
2 answers
Does attention make sense for Autoencoders?
I am struggling with the concept of attention in the the context of autoencoders. I believe I understand the usage of attention with regards to seq2seq translation - after training the combined encoder and decoder, we can use both encoder and…

user3641187
- 405
- 5
- 10
16
votes
3 answers
How to build a attention model with keras?
I am trying to understand attention model and also build one myself. After many searches I came across this website which had an atteniton model coded in keras and also looks simple. But when I tried to build that same model in my machine its giving…

Eka
- 14,170
- 38
- 128
- 212
16
votes
5 answers
RuntimeError: "exp" not implemented for 'torch.LongTensor'
I am following this tutorial: http://nlp.seas.harvard.edu/2018/04/03/attention.html
to implement the Transformer model from the "Attention Is All You Need" paper.
However I am getting the following error :
RuntimeError: "exp" not implemented for…

noob
- 5,954
- 6
- 20
- 32
16
votes
2 answers
Attention Layer throwing TypeError: Permute layer does not support masking in Keras
I have been following this post in order to implement attention layer over my LSTM model.
Code for the attention layer:
INPUT_DIM = 2
TIME_STEPS = 20
SINGLE_ATTENTION_VECTOR = False
APPLY_ATTENTION_BEFORE_LSTM = False
def…

Saurav--
- 1,530
- 2
- 15
- 33
14
votes
1 answer
How visualize attention LSTM using keras-self-attention package?
I'm using (keras-self-attention) to implement attention LSTM in KERAS. How can I visualize the attention part after training the model? This is a time series forecasting case.
from keras.models import Sequential
from keras_self_attention import…

Eghbal
- 3,892
- 13
- 51
- 112
14
votes
2 answers
Why does embedding vector multiplied by a constant in Transformer model?
I am learning to apply Transform model proposed by Attention Is All You Need from tensorflow official document Transformer model for language understanding.
As section Positional encoding says:
Since this model doesn't contain any recurrence or…

giser_yugang
- 6,058
- 4
- 21
- 44
13
votes
2 answers
Keras - Add attention mechanism to an LSTM model
With the following code:
model = Sequential()
num_features = data.shape[2]
num_samples = data.shape[1]
model.add(
LSTM(16, batch_input_shape=(None, num_samples, num_features), return_sequences=True,…

Shlomi Schwartz
- 8,693
- 29
- 109
- 186
12
votes
2 answers
Why embed dimemsion must be divisible by num of heads in MultiheadAttention?
I am learning the Transformer. Here is the pytorch document for MultiheadAttention. In their implementation, I saw there is a constraint:
assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
Why require…

jason
- 1,998
- 3
- 22
- 42
12
votes
2 answers
Should RNN attention weights over variable length sequences be re-normalized to "mask" the effects of zero-padding?
To be clear, I am referring to "self-attention" of the type described in Hierarchical Attention Networks for Document Classification and implemented many places, for example: here. I am not referring to the seq2seq type of attention used in…

t-flow
- 123
- 8
12
votes
1 answer
Visualizing attention activation in Tensorflow
Is there a way to visualize the attention weights on some input like the figure in the link above(from Bahdanau et al., 2014), in TensorFlow's seq2seq models? I have found TensorFlow's github issue regarding this, but I couldn't find out how to…

reiste
- 123
- 1
- 5
11
votes
2 answers
How can LSTM attention have variable length input
The attention mechanism of LSTM is a straight softmax feed forward network that takes in the hidden states of each time step of the encoder and the decoder's current state.
These 2 steps seems to contradict and can't wrap my head around:
1) The…

Andrew Tu
- 258
- 3
- 8