I have noticed that if I tokenize a full text with many sentences, I sometimes get a different number of tokens than if I tokenise each sentence individually and add up the tokens. I have done some debugging and have this small reproducible example to show the issue
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn')
print(tokenizer.tokenize("Thames is a river"))
print(tokenizer.tokenize("We are in London. Thames is a river"))
I get the following output
['Th', 'ames', 'Ġis', 'Ġa', 'Ġriver']
['We', 'Ġare', 'Ġin', 'ĠLondon', '.', 'ĠThames', 'Ġis', 'Ġa', 'Ġriver']
I would like to understand why the word Thames has been split into two tokens when it’s at the start of sequence, whereas it’s a single word if it’s not at the start of sequence. I have noticed this behaviour is very frequent and, assuming it’s not a bug, I would like to understand why the BART tokeniser behaves like this.