I used the XLM-RoBERTa tokenizer in order to get the IDs for a bunch of sentences such as:
["loving is great", "This is another example"]
I see that the IDs returned are not always as many as the number of whitespace-separated tokens in my sentences: for example, the first sentence corresponds to [[0, 459, 6496, 83, 6782, 2]]
, with loving
being 456
and 6496
. After getting the matrix for the word embeddings from the IDs, I was trying to identify only those word embeddings/vectors corresponding to some specific tokens: is there a way to do that? If the original tokens are sometimes assigned more than one ID and this cannot be predicted, I do not see how this is possible.
More in general, my task is to get word embeddings for some specific tokens within a sentence: my goal is therefore to use first the sentence so that word embeddings of single tokens can be calculated within the syntactic context, but then I would like to identify/keep the vectors of only some specific tokens and not those of all tokens in the sentence.