Questions tagged [self-attention]
57 questions
3
votes
0 answers
How is scaled_dot_product_attention meant to be used with cached keys/values in causal LM?
I'm implementing a transformer and I have everything working, including attention using the new scaled_dot_product_attention from PyTorch 2.0. I'll only be doing causal attention, however, so it seems like it makes sense to use the is_causal=True…

turboderp
- 31
- 2
3
votes
0 answers
Implementing 1D self attention in PyTorch
I'm trying to implement the 1D self-attention block below using PyTorch:
proposed in the following paper. Below you can find my (provisional) attempt:
import torch.nn as nn
import torch
#INPUT shape ((B), CH, H, W)
class…

James Arten
- 523
- 5
- 16
2
votes
1 answer
Have I implemented self-attention correctly in Pytorch?
This is my attempt at implementing self-attention using PyTorch. Have I done anything wrong, or could it be improved somehow?
class SelfAttention(nn.Module):
def __init__(self, embedding_dim):
super(SelfAttention, self).__init__()
…

Henry Gordon
- 21
- 2
2
votes
0 answers
MultiHeadAttention masking with tensorflow
I have been trying to make a custom mask for targetted combinations of queries and keys for my MultiHeadAttention layer but can not figure out the way to use this layer masking.
Here is an example with a dummy dataset (batch size 1) :
key =…

Etienne Salimbeni
- 506
- 5
- 7
2
votes
1 answer
Keras MultiHeadAttention layer throwing IndexError: tuple index out of range
I'm getting this error over and over again when trying to do self attention on 1D vectors, I don't really understand why that happens, any help would be greatly appreciated.
layer = layers.MultiHeadAttention(num_heads=2, key_dim=2)
target =…

Fourat Thamri
- 73
- 6
2
votes
1 answer
For an image or sequence, what is the properties transformers use?
Today my teacher ask me a question: he said the CNN is use the translation invariance of the images or matrixs. So what is the properties Transformer uses ???

qicheng wang
- 43
- 3
2
votes
1 answer
How to implement hierarchical Transformer for document classification in Keras?
Hierarchical attention mechanism for document classification has been presented by Yang et al.
https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf
Its implementation is available on…

Rahman
- 410
- 6
- 26
1
vote
0 answers
Customize the attention mechanism but report an error, tensorflow: Gradients do not exist for variables
Sorry, I just started learning to build neural networks using tensorflow.
The tensorflow version I am using is 2.3.
I want to use a custom attention layer to associate the output of the encoding layer with another input, this is my main…

Ike
- 11
- 1
1
vote
1 answer
tensorflow 2.10 vs 2.12, same training script, same data, significantly worse training for 2.12
I use this code https://www.kaggle.com/code/ritvik1909/masked-autoencoder-vision-transformer to train a network a transformer autoencoder. If I use the code under tensorflow 2.10, I obtain way better results than if I use 2.12. I don't change the…

PMDP3
- 35
- 7
1
vote
1 answer
How to read a BERT attention weight matrix?
I have extracted from the last layer and the last attention head of my BERT model the attention score/weights matrix. However I am not too sure how to read them. The matrix is the following one. I tried to find some more information in the…

Chiara
- 372
- 5
- 17
1
vote
0 answers
How to access the value projection at MultiHeadAttention layer in Pytorch
I'm making an own implementation for the Graphormer architecture. Since this architecture needs to add an edge-based bias to the output for the key-query multiplication at the self-attention mechanism I am adding that bias by hand and doing the…

Angelo
- 575
- 3
- 18
1
vote
0 answers
Receptive Field in Swin Transformer
I want to ask if is it true that the receptive field of swin transformer is just in the local window where we compute the self-attention? And is there any way to increase the receptive field when using swin transformer?
I know that when we use…

killermama98
- 45
- 5
1
vote
1 answer
Tensorflow Multi Head Attention on Inputs: 4 x 5 x 20 x 64 with attention_axes=2 throwing mask dimension error (tf 2.11.0)
The expectation here is that the attention is applied on the 2nd dimension (4, 5, 20, 64). I am trying to apply self attention using the following code (issue reproducible with this code):
import numpy as np
import tensorflow as tf
from keras import…

Vidyadhar Mudium
- 77
- 1
- 5
1
vote
1 answer
How to understand the self-attention mask implementation in google transformer tutorial
I am reading google's transformer tutorial, and the part why the attention_mask for multi-head attention can be built via mask1 & mask2 was unclear to me. Any help would be great!
def call(self, x, training, mask):
# A boolean mask.
if…

user1269298
- 717
- 2
- 8
- 26
1
vote
0 answers
How to extract graph node embeddings from a Pytorch-Geometric GAT model?
Dataset Strucute: Temporal directed graph; Nodes have features; Edges don't have features; Nodes are labelled. Using the Elliptic Dataset
Task: Classify nodes/ Predict node labels.
Data Structure: 2 .csv files of nodes and edges.
For the nodes csv…

Fardin Ahsan
- 33
- 7