Questions tagged [byte-pair-encoding]
5 questions
2
votes
0 answers
SentencePiece tokenizer encodes to unknown token
I am using HuggigFace implementation of SentencePiece tokenizer, i.e., SentencePieceBPETokenizer and SentencePieceUnigramTokenizer classes. I train these tokenizers on dataset which has no unicode characters and then try to encode the string that…

Shital Shah
- 63,284
- 17
- 238
- 185
1
vote
1 answer
Some doubts about huggingface's BPE algorithm
In most BPE(Byte-Pair Encoding) tutorials, it is mentioned to add after a word. The function of this mark is to distinguish whether a subword is a prefix of a word or a suffix of a word.
We know that the input of the model is a sequence of…

korangar leo
- 13
- 2
0
votes
0 answers
Byte-level BPE tokenizer for handing Bigram and Trigram
I'm currently employing the HuggingFace tokenizer to tokenize a textual database, and here's how I'm doing it:
from tokenizers import ByteLevelBPETokenizer
from tokenizers import normalizers
tokenizer = ByteLevelBPETokenizer()
tokenizer.normalizer…

Eghbal
- 3,892
- 13
- 51
- 112
0
votes
0 answers
Low Frequency Tokens in BPE
Learning about tokenization, I implemented the BPE algorithm and trained it on a small corpus: the full text of Harry Potter. I noticed the following thing: my vocabulary contains tokens for "Dumbledore" and " Dumbledore" (notice the leading space),…

Yo.
- 15
- 6
-2
votes
2 answers
Simplifying ngram loops to compress the string given a fix set of ngrams
Given in list of characters, list('Hello▁world▁') and a list of character tuples, i.e.
[('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'),
('He', 'll'), ('worl', 'd▁'), ('wor',…

alvas
- 115,346
- 109
- 446
- 738