How do you get single embedding vector for each word (token) from RoBERTa?

Question

As you may know, RoBERTa (BERT, etc.) has its own tokenizer and sometimes you get pieces of given word as tokens, e.g. embeddings » embed, #dings

Since the nature of the task I am working on, I need a single representation for each word. How do I get it?

CLEARANCE:

sentence: "embeddings are good" --> 3 word tokens given
output: [embed,#dings,are,good] --> 4 tokens are out

When I give sentence to pre-trained RoBERTa, I get encoded tokens. At the end I need representation for each token. Whats the solution? Summing embed + #dings tokens point-wise?

score 1 · Answer 1 · answered Feb 01 '21 at 04:43

1

I'm not sure if there is standard practice, but what I saw the others have done is to simply take the average of the sub-tokens embeddings. example: https://arxiv.org/abs/2006.01346, Section 2.3 line 4

answered Feb 01 '21 at 04:43

Crystina

990
1
5
16

How do you get single embedding vector for each word (token) from RoBERTa?

1 Answers1