from transformers import AutoModel, AutoTokenizer
tokenizer1 = AutoTokenizer.from_pretrained("roberta-base")
tokenizer2 = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "A Titan RTX has 24GB of VRAM"
print(tokenizer1.tokenize(sequence))
print(tokenizer2.tokenize(sequence))
Output:
['A', 'ĠTitan', 'ĠRTX', 'Ġhas', 'Ġ24', 'GB', 'Ġof', 'ĠVR', 'AM']
['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
Bert model uses WordPiece tokenizer. Any word that does not occur in the WordPiece vocabulary is broken down into sub-words greedily. For example, 'RTX' is broken into 'R', '##T' and '##X' where ## indicates it is a subtoken.
Roberta uses BPE tokenizer but I'm unable to understand
a) how BPE tokenizer works?
b) what does G represents in each of tokens?