The last layers of longformer for document embeddings

Question

What's the right way to return a limited number of layers using the longformer API?

Unlike this case in basic BERT, it's not clear to me from the return type how to get only the last N layers.

So, I run this:

from transformers import LongformerTokenizer, LongformerModel

text = "word " * 4096 # long document!

tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")

encoded_input = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True)
output = model(**encoded_input)

And I get dimensions like so from my return:

>>> output[0].shape
torch.Size([1, 4096, 768])

>>> output[1].shape
torch.Size([1, 768])

You can see the shape of [0] is curiously similar to my number of tokens. I believe that slicing this would just give me fewer tokens, not just the last N layers.

Update from answer below

Even asking for output_hidden_states, the dimensions still look off, and it's not clear to me how to reduce these to vector sized, 1-d embedding. Here's what I mean:

encoded_input = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True)
output = model(**encoded_input, output_hidden_states=True)

Ok, now let's look into output[2], the third item of the tuple:

>>> len(output[2])
13

Suppose we want to see the last 3 of the 13 layers:

>>> [pair[0].shape for pair in output[2][-3:]]
[torch.Size([4096, 768]), torch.Size([4096, 768]), torch.Size([4096, 768])]

So we see each of the 13 layers is shaped (4096 x 768), and they look like:

>>> [pair[0] for pair in output[2][-3:]]
[tensor([[-0.1494,  0.0190,  0.0389,  ..., -0.0470,  0.0259,  0.0609],

We still have a size of 4096, in that it corresponds to my token count:

>>> np.mean(np.stack([pair[0].detach().numpy() for pair in output[2][-3:]]), axis=0).shape
(4096, 768)

Averaging these together does not seem like it would give a valid embedding (for comparisons like cosine similarity).

If you need to reduce the output size you'll need to use a pooling layer" https://machinelearningmastery.com/pooling-layers-for-convolutional-neural-networks/ — slim, Jul 08 '22 at 16:13

cronoik · Answer 1 · 2020-10-06T04:48:55.597

1

output is a tuple consisting of two elements:

sequence_output (i.e. last encoder block)
pooled_output

In order to obtain all hidden layers, you need to set the parameter output_hidden_states to true:

output = model(**encoded_input, output_hidden_states=True)

The output has now 3 elements and the third element contains the output of the embedding layer and each encoding layer.

edited Oct 06 '20 at 04:48

answered Oct 06 '20 at 04:39

cronoik

15,434
3
40
78

Thanks, that is helpful, but I think I am still confused on one step after that being the same suspicious size as my token count. – Mittenchops Oct 06 '20 at 16:34
@Mittenchops Please don't modify your questions that far that my answer becomes useless. Open a new question instead. In general, I think your approach is wrong. Not sure about Longformer, but [BERT does not produce meaningful sentence representations](https://github.com/google-research/bert/issues/164#issuecomment-441324222) that could be used for cosine similarity or something like that because it requires all dimension to have the same scale. I think this also applies to Longformer. You can try [sentence-transformers](https://github.com/UKPLab/sentence-transformers) instead. – cronoik Oct 06 '20 at 21:47
Thanks, @cronoik. I did find your answer helpful to see that there was a third tuple I could request, but there is still a gap between that and getting valid embeddings, so it seemed appropriate to update it. I've certainly had questions closed for being /too similar/ to the previous if I had just taken the update as a new question. I have been encouraged by moderators to make updates for clarity. – Mittenchops Oct 06 '20 at 22:05
I also understand that it is quite common to take terminal layers of BERT embeddings for use in applications like similarity: https://stackoverflow.com/a/63464865/1052117 and the Bertology paper it refs: https://arxiv.org/pdf/2002.12327.pdf – Mittenchops Oct 06 '20 at 22:09
1

@Mittenchops I have posted my answer as an answer to your other question. – cronoik Oct 07 '20 at 04:52

score 0 · Answer 2 · answered Jul 19 '21 at 06:42

0

print(outputs.keys())            
#odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states'])

print("outputs[0] gives us sequence_output: \n", outputs[0].shape) #torch.Size([1, 34, 768])

print("outputs[1] gives us pooled_output  \n",outputs[1]) #Embeddings ( last hidden state) #[768]
            
print("outputs[2]: gives us Hidden_output: \n ",outputs[2][0].shape) #torch.Size([1, 512, 768])

For your use-case you can use output[1] as embeddings.

answered Jul 19 '21 at 06:42

MAC

1,345
2
30
60

I'm trying to change the number of features from 768 to 50. What's the easiest way to do that? – Alex L Apr 26 '22 at 16:48
A pooling layer. – slim Jul 08 '22 at 16:12

The last layers of longformer for document embeddings

Update from answer below

2 Answers2

Linked