How to get embeddings from long texts without pooling?

Question

I have a collection of relatively long texts, containing roughly 2k tokens each. I want to covert each into an embedding.

I found that sentence-transformers is quite popular, but it can only take short sequences into account. One approach is to create an embedding for each sentence and then average the results, but I don't want to do that. I'm interested in getting an embedding without any pooling operations. I also found Huggingface's feature extraction pipeline, but from here I understand that it also contains some pooling operator over sequences (unless I'm mistaken).

For example, let's say I want to use a GPT2 model (or some other model that can take long sequences into account):

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
text = "Some very very very long text"

I do not understand how your GPT2 code snippet relates to the question - there are no embeddings there, neither in input nor in output. — MrTJ, Jul 28 '23 at 07:43
@MrTJ This is just an example of a model that I thought could be potentially used for its long context length. I may be wrong though, it's just the best I could think of — Penguin, Jul 28 '23 at 12:16
@Ilya Because I specifically said in the question "without pooling", and you completely ignored that and your answer contains the mean pooling operation: "i use mean pooling operation" — Penguin, Jul 28 '23 at 13:29

How to get embeddings from long texts without pooling?

0 Answers0