Difficulty in understanding the tokenizer used in Roberta model

Question

from transformers import AutoModel, AutoTokenizer

tokenizer1 = AutoTokenizer.from_pretrained("roberta-base")
tokenizer2 = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "A Titan RTX has 24GB of VRAM"
print(tokenizer1.tokenize(sequence))
print(tokenizer2.tokenize(sequence))

Output:

['A', 'ĠTitan', 'ĠRTX', 'Ġhas', 'Ġ24', 'GB', 'Ġof', 'ĠVR', 'AM']

['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']

Bert model uses WordPiece tokenizer. Any word that does not occur in the WordPiece vocabulary is broken down into sub-words greedily. For example, 'RTX' is broken into 'R', '##T' and '##X' where ## indicates it is a subtoken.

Roberta uses BPE tokenizer but I'm unable to understand

a) how BPE tokenizer works?

b) what does G represents in each of tokens?

score 24 · Accepted Answer · answered Apr 10 '20 at 07:45

This question is extremely broad, so I'm trying to give an answer that focuses on the main problem at hand. If you feel the need to have other questions answered, please open another question focusing on one question at a time, see the [help/on-topic] rules for Stackoverflow.

Essentially, as you've correctly identified, BPE is central to any tokenization in modern deep networks. I highly recommend you to read the original BPE paper by Sennrich et al., in which they also highlight a bit more of the history of BPEs.
In any case, the tokenizers for any of the huggingface models are pretrained, meaning that they are usually generated from the training set of the algorithm beforehand. Common implementations such as SentencePiece also give a bit better understanding of it, but essentially the task is framed as a constrained optimization problem, where you specify a maximum number of k allowed vocabulary words (the constraint), and the algorithm tries to then keep as many words intact without exceeding k.

if there are not enough words to cover the whole vocabulary, smaller units are used to approximate the vocabulary, which results in the splits observed in the example you gave. RoBERTa uses a variant called "byte-level BPE", the best explanation is probably given in this study by Wang et al.. The main benefit is, that it results in a smaller vocabulary while maintaining the quality of splits, from what I understand.

The second part of your question is easier to explain; while BERT highlights the merging of two subsequent tokens (with ##), RoBERTa's tokenizer instead highlights the start of a new token with a specific unicode character (in this case, \u0120, the G with a dot). The best reason I could find for this was this thread, which argues that it basically avoids the use of whitespaces in training.

score 1 · Answer 2 · answered Mar 21 '22 at 19:44

a) I'd recommend giving this a read. Essentially, BPE (Byte-Pair-Encoding) takes a hyperparameter k, and tries to construct <=k amount of char sequences to be able to express all the words in the training text corpus. RoBERTa uses byte-level BPE, which sets the base vocabulary to be 256, i.e. how many unicode characters there are.

b) The G with a dot (Ġ) is seemingly a random pick, it could've been any character. As long as there is a character to encode for Having fiddled around, RobertaTokenizer also makes use of other "atypical" characters for encoding, such as 'Ĵ' (u/0134) 'Ĺ' (u/0139), and '¤' (u/0164) to encode for emojis, for example.

Difficulty in understanding the tokenizer used in Roberta model

2 Answers2