Questions tagged [byte-pair-encoding]

5 questions
2
votes
0 answers

SentencePiece tokenizer encodes to unknown token

I am using HuggigFace implementation of SentencePiece tokenizer, i.e., SentencePieceBPETokenizer and SentencePieceUnigramTokenizer classes. I train these tokenizers on dataset which has no unicode characters and then try to encode the string that…
1
vote
1 answer

Some doubts about huggingface's BPE algorithm

In most BPE(Byte-Pair Encoding) tutorials, it is mentioned to add after a word. The function of this mark is to distinguish whether a subword is a prefix of a word or a suffix of a word. We know that the input of the model is a sequence of…
0
votes
0 answers

Byte-level BPE tokenizer for handing Bigram and Trigram

I'm currently employing the HuggingFace tokenizer to tokenize a textual database, and here's how I'm doing it: from tokenizers import ByteLevelBPETokenizer from tokenizers import normalizers tokenizer = ByteLevelBPETokenizer() tokenizer.normalizer…
Eghbal
  • 3,892
  • 13
  • 51
  • 112
0
votes
0 answers

Low Frequency Tokens in BPE

Learning about tokenization, I implemented the BPE algorithm and trained it on a small corpus: the full text of Harry Potter. I noticed the following thing: my vocabulary contains tokens for "Dumbledore" and " Dumbledore" (notice the leading space),…
Yo.
  • 15
  • 6
-2
votes
2 answers

Simplifying ngram loops to compress the string given a fix set of ngrams

Given in list of characters, list('Hello▁world▁') and a list of character tuples, i.e. [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), ('He', 'll'), ('worl', 'd▁'), ('wor',…
alvas
  • 115,346
  • 109
  • 446
  • 738