Highest Voted 'byte-pair-encoding' Questions

2

votes

0 answers

SentencePiece tokenizer encodes to unknown token

I am using HuggigFace implementation of SentencePiece tokenizer, i.e., SentencePieceBPETokenizer and SentencePieceUnigramTokenizer classes. I train these tokenizers on dataset which has no unicode characters and then try to encode the string that…

asked Aug 02 '23 at 08:58

Shital Shah

63,284
17
238
185

1

vote

1 answer

Some doubts about huggingface's BPE algorithm

In most BPE(Byte-Pair Encoding) tutorials, it is mentioned to add after a word. The function of this mark is to distinguish whether a subword is a prefix of a word or a suffix of a word. We know that the input of the model is a sequence of…

huggingface-transformers decoding byte-pair-encoding

asked Aug 13 '23 at 07:01

korangar leo

13
2

0

votes

0 answers

Byte-level BPE tokenizer for handing Bigram and Trigram

I'm currently employing the HuggingFace tokenizer to tokenize a textual database, and here's how I'm doing it: from tokenizers import ByteLevelBPETokenizer from tokenizers import normalizers tokenizer = ByteLevelBPETokenizer() tokenizer.normalizer…

tokenize huggingface huggingface-tokenizers byte-pair-encoding

asked Aug 12 '23 at 14:12

Eghbal

3,892
13
51
112

0

votes

0 answers

Low Frequency Tokens in BPE

Learning about tokenization, I implemented the BPE algorithm and trained it on a small corpus: the full text of Harry Potter. I noticed the following thing: my vocabulary contains tokens for "Dumbledore" and " Dumbledore" (notice the leading space),…

nlp tokenize large-language-model byte-pair-encoding

asked May 15 '23 at 08:38

Yo.

15
6

-2

votes

2 answers

Simplifying ngram loops to compress the string given a fix set of ngrams

Given in list of characters, list('Hello▁world▁') and a list of character tuples, i.e. [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), ('He', 'll'), ('worl', 'd▁'), ('wor',…

python string while-loop byte-pair-encoding string-compression

asked Apr 25 '23 at 11:40

alvas

115,346
109
446
738

Questions tagged [byte-pair-encoding]

SentencePiece tokenizer encodes to unknown token

Some doubts about huggingface's BPE algorithm

Byte-level BPE tokenizer for handing Bigram and Trigram

Low Frequency Tokens in BPE

Simplifying ngram loops to compress the string given a fix set of ngrams