What's the right way to return a limited number of layers using the longformer API?
Unlike this case in basic BERT, it's not clear to me from the return type how to get only the last N layers.
So, I run this:
from transformers import LongformerTokenizer, LongformerModel
text = "word " * 4096 # long document!
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")
encoded_input = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True)
output = model(**encoded_input)
And I get dimensions like so from my return:
>>> output[0].shape
torch.Size([1, 4096, 768])
>>> output[1].shape
torch.Size([1, 768])
You can see the shape of [0] is curiously similar to my number of tokens. I believe that slicing this would just give me fewer tokens, not just the last N layers.
Update from answer below
Even asking for output_hidden_states
, the dimensions still look off, and it's not clear to me
how to reduce these to vector sized, 1-d embedding. Here's what I mean:
encoded_input = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True)
output = model(**encoded_input, output_hidden_states=True)
Ok, now let's look into output[2], the third item of the tuple:
>>> len(output[2])
13
Suppose we want to see the last 3 of the 13 layers:
>>> [pair[0].shape for pair in output[2][-3:]]
[torch.Size([4096, 768]), torch.Size([4096, 768]), torch.Size([4096, 768])]
So we see each of the 13 layers is shaped (4096 x 768), and they look like:
>>> [pair[0] for pair in output[2][-3:]]
[tensor([[-0.1494, 0.0190, 0.0389, ..., -0.0470, 0.0259, 0.0609],
We still have a size of 4096, in that it corresponds to my token count:
>>> np.mean(np.stack([pair[0].detach().numpy() for pair in output[2][-3:]]), axis=0).shape
(4096, 768)
Averaging these together does not seem like it would give a valid embedding (for comparisons like cosine similarity).