Questions tagged [self-attention]

57 questions
3
votes
0 answers

How is scaled_dot_product_attention meant to be used with cached keys/values in causal LM?

I'm implementing a transformer and I have everything working, including attention using the new scaled_dot_product_attention from PyTorch 2.0. I'll only be doing causal attention, however, so it seems like it makes sense to use the is_causal=True…
turboderp
  • 31
  • 2
3
votes
0 answers

Implementing 1D self attention in PyTorch

I'm trying to implement the 1D self-attention block below using PyTorch: proposed in the following paper. Below you can find my (provisional) attempt: import torch.nn as nn import torch #INPUT shape ((B), CH, H, W) class…
James Arten
  • 523
  • 5
  • 16
2
votes
1 answer

Have I implemented self-attention correctly in Pytorch?

This is my attempt at implementing self-attention using PyTorch. Have I done anything wrong, or could it be improved somehow? class SelfAttention(nn.Module): def __init__(self, embedding_dim): super(SelfAttention, self).__init__() …
2
votes
0 answers

MultiHeadAttention masking with tensorflow

I have been trying to make a custom mask for targetted combinations of queries and keys for my MultiHeadAttention layer but can not figure out the way to use this layer masking. Here is an example with a dummy dataset (batch size 1) : key =…
2
votes
1 answer

Keras MultiHeadAttention layer throwing IndexError: tuple index out of range

I'm getting this error over and over again when trying to do self attention on 1D vectors, I don't really understand why that happens, any help would be greatly appreciated. layer = layers.MultiHeadAttention(num_heads=2, key_dim=2) target =…
2
votes
1 answer

For an image or sequence, what is the properties transformers use?

Today my teacher ask me a question: he said the CNN is use the translation invariance of the images or matrixs. So what is the properties Transformer uses ???
2
votes
1 answer

How to implement hierarchical Transformer for document classification in Keras?

Hierarchical attention mechanism for document classification has been presented by Yang et al. https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf Its implementation is available on…
Rahman
  • 410
  • 6
  • 26
1
vote
0 answers

Customize the attention mechanism but report an error, tensorflow: Gradients do not exist for variables

Sorry, I just started learning to build neural networks using tensorflow. The tensorflow version I am using is 2.3. I want to use a custom attention layer to associate the output of the encoding layer with another input, this is my main…
Ike
  • 11
  • 1
1
vote
1 answer

tensorflow 2.10 vs 2.12, same training script, same data, significantly worse training for 2.12

I use this code https://www.kaggle.com/code/ritvik1909/masked-autoencoder-vision-transformer to train a network a transformer autoencoder. If I use the code under tensorflow 2.10, I obtain way better results than if I use 2.12. I don't change the…
1
vote
1 answer

How to read a BERT attention weight matrix?

I have extracted from the last layer and the last attention head of my BERT model the attention score/weights matrix. However I am not too sure how to read them. The matrix is the following one. I tried to find some more information in the…
1
vote
0 answers

How to access the value projection at MultiHeadAttention layer in Pytorch

I'm making an own implementation for the Graphormer architecture. Since this architecture needs to add an edge-based bias to the output for the key-query multiplication at the self-attention mechanism I am adding that bias by hand and doing the…
1
vote
0 answers

Receptive Field in Swin Transformer

I want to ask if is it true that the receptive field of swin transformer is just in the local window where we compute the self-attention? And is there any way to increase the receptive field when using swin transformer? I know that when we use…
1
vote
1 answer

Tensorflow Multi Head Attention on Inputs: 4 x 5 x 20 x 64 with attention_axes=2 throwing mask dimension error (tf 2.11.0)

The expectation here is that the attention is applied on the 2nd dimension (4, 5, 20, 64). I am trying to apply self attention using the following code (issue reproducible with this code): import numpy as np import tensorflow as tf from keras import…
1
vote
1 answer

How to understand the self-attention mask implementation in google transformer tutorial

I am reading google's transformer tutorial, and the part why the attention_mask for multi-head attention can be built via mask1 & mask2 was unclear to me. Any help would be great! def call(self, x, training, mask): # A boolean mask. if…
user1269298
  • 717
  • 2
  • 8
  • 26
1
vote
0 answers

How to extract graph node embeddings from a Pytorch-Geometric GAT model?

Dataset Strucute: Temporal directed graph; Nodes have features; Edges don't have features; Nodes are labelled. Using the Elliptic Dataset Task: Classify nodes/ Predict node labels. Data Structure: 2 .csv files of nodes and edges. For the nodes csv…
1
2 3 4