Huggingface BERT Tokenizer add new token

Question

I am using Huggingface BERT for an NLP task. My texts contain names of companies which are split up into subwords.

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenizer.encode_plus("Somespecialcompany")
output: {'input_ids': [101, 2070, 13102, 8586, 4818, 9006, 9739, 2100, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Now, I would like to add those names to the tokenizer IDs so they are not split up.

tokenizer.add_tokens("Somespecialcompany")
output: 1

This extends the length of the tokenizer from 30522 to 30523.

The desired output would therefore be the new ID:

tokenizer.encode_plus("Somespecialcompany")
output: 30522

But the output is the same as before:

output: {'input_ids': [101, 2070, 13102, 8586, 4818, 9006, 9739, 2100, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

So my question is; what is the right way of adding new tokens to the tokenizer so I can use them with tokenizer.encode_plus() and tokenizer.batch_encode_plus()?

It works with the slow tokenizer. Please open a [bug report](https://github.com/huggingface/tokenizers/issues) for the tokenizers library. — cronoik, Nov 03 '20 at 22:44
See also https://stackoverflow.com/questions/76198051/how-to-add-new-tokens-to-an-existing-huggingface-tokenizer/ — alvas, May 08 '23 at 06:46

score 5 · Answer 1 · edited Oct 15 '22 at 20:37

5

I opened a bug report on github. And apparently I just have to set the special_tokens argument to True:

tokenizer.add_tokens(["somecompanyname"], special_tokens=True)
output: 30522

edited Oct 15 '22 at 20:37

desertnaut

57,590
26
140
166

answered Nov 05 '20 at 16:17

Nui

101
1
6

I still think this is not correct. Please have a look at the [comment](https://github.com/huggingface/tokenizers/issues/507#issuecomment-722705812). – cronoik Nov 05 '20 at 23:36
Yea I saw the comment. I guess your point, that the two tokenizers should behave the same way, is correct. But it works with special_tokes=True the way I want so its fine for me. – Nui Nov 06 '20 at 11:56

score 0 · Answer 2 · answered Apr 18 '23 at 17:55

I'm not sure you want to be adding it as a special token; special tokens have other behavior that would not be desirable here (e.g. decode with skip_special_tokens=True). Try using the AddedToken class with single_word=True instead:

tokenizer.add_tokens(tokenizers.AddedToken("somecompanyname", single_word=True))

see here: https://huggingface.co/docs/tokenizers/v0.13.3/en/api/added-tokens

alvas · Answer 3 · 2023-05-08T06:54:03.310

Source: https://www.depends-on-the-definition.com/how-to-add-new-tokens-to-huggingface-transformers/

from transformers import AutoTokenizer, AutoModel

# pick the model type
model_type = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_type)
model = AutoModel.from_pretrained(model_type)

# new tokens
new_tokens = ["new_token"]

# check if the tokens are already in the vocabulary
new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())

# add the tokens to the tokenizer vocabulary
tokenizer.add_tokens(list(new_tokens))

# add new, random embeddings for the new tokens
model.resize_token_embeddings(len(tokenizer))

Huggingface BERT Tokenizer add new token

3 Answers3

Linked