I have a collection of relatively long texts, containing roughly 2k tokens each. I want to covert each into an embedding.
I found that sentence-transformers is quite popular, but it can only take short sequences into account. One approach is to create an embedding for each sentence and then average the results, but I don't want to do that. I'm interested in getting an embedding without any pooling operations. I also found Huggingface's feature extraction pipeline, but from here I understand that it also contains some pooling operator over sequences (unless I'm mistaken).
For example, let's say I want to use a GPT2 model (or some other model that can take long sequences into account):
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
text = "Some very very very long text"