I am using Huggingface BERT for an NLP task. My texts contain names of companies which are split up into subwords.
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenizer.encode_plus("Somespecialcompany")
output: {'input_ids': [101, 2070, 13102, 8586, 4818, 9006, 9739, 2100, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
Now, I would like to add those names to the tokenizer IDs so they are not split up.
tokenizer.add_tokens("Somespecialcompany")
output: 1
This extends the length of the tokenizer from 30522 to 30523.
The desired output would therefore be the new ID:
tokenizer.encode_plus("Somespecialcompany")
output: 30522
But the output is the same as before:
output: {'input_ids': [101, 2070, 13102, 8586, 4818, 9006, 9739, 2100, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
So my question is; what is the right way of adding new tokens to the tokenizer so I can use them with tokenizer.encode_plus()
and tokenizer.batch_encode_plus()
?