3

How do i get an embedding for the whole sentence from huggingface's feature extraction pipeline?

I understand how to get the features for each token (below) but how do i get the overall features for the sentence as a whole?

feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction("i am sentence")
stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
user3472360
  • 1,337
  • 1
  • 16
  • 29

3 Answers3

6

To explain more on the comment that I have put under stackoverflowuser2010's answer, I will use "barebone" models, but the behavior is the same with the pipeline component.

BERT and derived models (including DistilRoberta, which is the model you are using in the pipeline) agenerally indicate the start and end of a sentence with special tokens (mostly denoted as [CLS] for the first token) that usually are the easiest way of making predictions/generating embeddings over the entire sequence. There is a discussion within the community about which method is superior (see also a more detailed answer by stackoverflowuser2010 here), however, if you simply want a "quick" solution, then taking the [CLS] token is certainly a valid strategy.

Now, while the documentation of the FeatureExtractionPipeline isn't very clear, in your example we can easily compare the outputs, specifically their lengths, with a direct model call:

from transformers import pipeline, AutoTokenizer

# direct encoding of the sample sentence
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
encoded_seq = tokenizer.encode("i am sentence")

# your approach
feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction("i am sentence")

# Compare lengths of outputs
print(len(encoded_seq)) # 5
# Note that the output has a weird list output that requires to index with 0.
print(len(features[0])) # 5

When inspecting the content of encoded_seq, you will notice that the first token is indexed with 0, denoting the beginning-of-sequence token (in our case, the embedding token). Since the output lengths are the same, you could then simply access a preliminary sentence embedding by doing something like

sentence_embedding = features[0][0]

dennlinger
  • 9,890
  • 1
  • 42
  • 63
  • 2
    Note that the embedding for the `[CLS]` token will be gibberish unless you fine-tune the model on a downstream task. I hypothesize that if you pool over the token embeddings as I suggested in my answer, then the resulting sentence embedding will have meaning without additional fine-tuning. The reason I bring this notion up is that the out-of-the-box `pipeline` classes do not provide an API for fine-tuning the underlying model. – stackoverflowuser2010 Nov 08 '20 at 21:13
  • Makes sense, also a good point! Would you suggest to only pool over actual tokens (since `[CLS]` and `[EOS]` are also tokens in the sample input on their own), or include the entire sequence regardless? – dennlinger Nov 09 '20 at 09:11
  • 1
    I would train on a downstream task to get good sentence embeddings. Using the NLI task seems to be the current best practice for doing so. I've been getting good empirical results by pooling over all the tokens, including subtokens (`##foo`) and special tokens (`[CLS]`, `[SEP]`), although I'd like to explore alternatives in the future. – stackoverflowuser2010 Nov 09 '20 at 19:23
  • I can't edit it cause the edition has less than 6 characters, but the `AutoTokenizer` has an invalid base name, it should be 'distilroberta-base'. –  Dec 30 '20 at 01:45
  • Hey everyone, I am new to this field and would like to ask, does special tokens like , have word embeddings? Lets say in models like BERT – Cosq Jun 05 '22 at 14:06
4

If you want to get meaningful embedding of whole sentence, please use SentenceTransformers. Pooling is well implemented in it and it also provides various APIs to Fine Tune models to produce features/embeddings at sentence/text-chunk level

pip install sentence-transformers

Once you have installed sentence-transformers, below code can be used to produce sentence embeddings

from sentence_transformers import SentenceTransformer
model_st = SentenceTransformer('distilroberta-base')
embeddings = model_st.encode('I am a sentence')
print(embeddings)

Visit official site for more info on sentence transformers.

Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
YoungSheldon
  • 774
  • 6
  • 19
  • 1
    Both sentence-transformers and pipeline provide identical embeddings, only that if you are using pipeline and you want a single embedding for the entire sentence, you need to do np.mean(features_from_pipeline, axis = 0). So from dennlinger's answer above (that uses the pipeline function), do `np.mean(features, axis=0)`. You will get a vector identical to `embeddings` from YoungSheldon's answer (that relies upon sentence-transformers). – Topchi Nov 11 '22 at 01:53
  • 1
    Minor correction to Topchi's input. Not all models do mean or mean pooling. Some models may have been trained on TSADE or SimCSE or any other architecture, in this case, just taking the mean of the features will not work. Say, if the model you are loading ('distilroberta-base' in the given example) is a generic transformer model, only then, comment by Topchi hold's true. – YoungSheldon Nov 14 '22 at 18:32
  • That is correct, I verified only for distilroberta-base, not other models, and YoungSheldon's observation is accurate – Topchi Nov 15 '22 at 22:47
0

If you have the embeddings for each token, you can create an overall sentence embedding by pooling (summarizing) over them. Note that if you have D-dimensional token embeddings, you should get a D-dimensional sentence embeddings through one of these approaches:

  1. Compute the mean over all token embeddings.

  2. Compute the max of each of the D-dimensions over all the token embeddings.

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
  • I don't think this is necessarily the (only) correct approach. Ideally, you can simply use the embedding of the `[CLS]` token, which should act as an embedding layer. I'll try to post an answer of how to acess this via the `pipeline` feature. While I agree that averaging is also a valid approach, especially when pre-training or fine-tuning, most approaches out there utilize the `[CLS]` token and not the general token embeddings. – dennlinger Nov 06 '20 at 11:13