2

As you may know, RoBERTa (BERT, etc.) has its own tokenizer and sometimes you get pieces of given word as tokens, e.g. embeddings ยป embed, #dings

Since the nature of the task I am working on, I need a single representation for each word. How do I get it?

CLEARANCE:

sentence: "embeddings are good" --> 3 word tokens given
output: [embed,#dings,are,good] --> 4 tokens are out

When I give sentence to pre-trained RoBERTa, I get encoded tokens. At the end I need representation for each token. Whats the solution? Summing embed + #dings tokens point-wise?

svyat1s
  • 868
  • 9
  • 12
  • 21

1 Answers1

1

I'm not sure if there is standard practice, but what I saw the others have done is to simply take the average of the sub-tokens embeddings. example: https://arxiv.org/abs/2006.01346, Section 2.3 line 4

Crystina
  • 990
  • 1
  • 5
  • 16