BigBird, or Sparse self-attention: How to implement a sparse matrix?

Question

This question is related to the new paper: Big Bird: Transformers for Longer Sequences. Mainly, about the implementation of the Sparse Attention (that is specified in the Supplemental material, part D). Currently, I am trying to implement it in PyTorch.

They suggest a new way to speed up the computation by blocking the original query and key matrices (see, below)

When you do the matrix multiplaciton in the step (b), you end up with something like that: .

So I was wondering: how would you go from that representation (image above) to a sparse matrix (using PyTorch, see below)? In the paper, they just say: "simply reshape the result", and I do not know any easy ways to do so (especially, when I have multiple blocks in different positions (see step (c) on the first image).

RESOLUTION: Huggingface has an implementation of BigBird in pytorch.

Germans Savcisens · Accepted Answer · 2021-01-05T12:14:45.257

2

I end up following the guidelines in the paper. When it comes to the unpacking of the result I use: torch.sparse_coo_tensor

EDIT: Sparse tensors are still memory-hungry! The more efficient solution is described here

edited Jan 05 '21 at 12:14

answered Jan 04 '21 at 15:48

Germans Savcisens

158
12

do you have a sparse attention mechanism implementation? – javac Mar 16 '21 at 15:48
you can look at deepspeed implementation – Germans Savcisens Mar 16 '21 at 20:46
UPDATE: Huggingface has an implementation of BigBird in pytorch (you can look into their code) – Germans Savcisens Mar 26 '21 at 12:50
Thank you, I am going to check, I am also looking for a simple technique for self-supervised learning, please let me know if you have information about it – javac Mar 26 '21 at 13:46

BigBird, or Sparse self-attention: How to implement a sparse matrix?

1 Answers1