0

Learning about tokenization, I implemented the BPE algorithm and trained it on a small corpus: the full text of Harry Potter. I noticed the following thing: my vocabulary contains tokens for "Dumbledore" and " Dumbledore" (notice the leading space), as one might expect when training on this corpus. However, the token "umbledore" is also part of the eventual vocabulary. I do understand why this happens and this is to be expected of the BPE algorithm.

However, this seems like a problem: when I now start doing anything with this tokenizer (eg training a DL model), the token "umbledore" will never occur in the corpus at all. This will result in a "glitch token" and we are just wasting parameters.

Some questions:

  • Is this a specific problem when training a tokenizer on such a specific corpus with rare long names or does this problem exists in general? Does the BPE algorithm always end up with some "intermediate tokens" / "low frequency tokens" in its vocabulary?
  • Is there any "fix" of this problem known? Or is it something we take for granted when using BPE?
Yo.
  • 15
  • 6

0 Answers0