5

I am using Huggingface BERT for an NLP task. My texts contain names of companies which are split up into subwords.

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenizer.encode_plus("Somespecialcompany")
output: {'input_ids': [101, 2070, 13102, 8586, 4818, 9006, 9739, 2100, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Now, I would like to add those names to the tokenizer IDs so they are not split up.

tokenizer.add_tokens("Somespecialcompany")
output: 1

This extends the length of the tokenizer from 30522 to 30523.

The desired output would therefore be the new ID:

tokenizer.encode_plus("Somespecialcompany")
output: 30522

But the output is the same as before:

output: {'input_ids': [101, 2070, 13102, 8586, 4818, 9006, 9739, 2100, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

So my question is; what is the right way of adding new tokens to the tokenizer so I can use them with tokenizer.encode_plus() and tokenizer.batch_encode_plus()?

tuomastik
  • 4,559
  • 5
  • 36
  • 48
Nui
  • 101
  • 1
  • 6
  • It works with the slow tokenizer. Please open a [bug report](https://github.com/huggingface/tokenizers/issues) for the tokenizers library. – cronoik Nov 03 '20 at 22:44
  • See also https://stackoverflow.com/questions/76198051/how-to-add-new-tokens-to-an-existing-huggingface-tokenizer/ – alvas May 08 '23 at 06:46

3 Answers3

5

I opened a bug report on github. And apparently I just have to set the special_tokens argument to True:

tokenizer.add_tokens(["somecompanyname"], special_tokens=True)
output: 30522
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Nui
  • 101
  • 1
  • 6
  • I still think this is not correct. Please have a look at the [comment](https://github.com/huggingface/tokenizers/issues/507#issuecomment-722705812). – cronoik Nov 05 '20 at 23:36
  • Yea I saw the comment. I guess your point, that the two tokenizers should behave the same way, is correct. But it works with special_tokes=True the way I want so its fine for me. – Nui Nov 06 '20 at 11:56
0

I'm not sure you want to be adding it as a special token; special tokens have other behavior that would not be desirable here (e.g. decode with skip_special_tokens=True). Try using the AddedToken class with single_word=True instead:

tokenizer.add_tokens(tokenizers.AddedToken("somecompanyname", single_word=True))

see here: https://huggingface.co/docs/tokenizers/v0.13.3/en/api/added-tokens

0

Source: https://www.depends-on-the-definition.com/how-to-add-new-tokens-to-huggingface-transformers/

from transformers import AutoTokenizer, AutoModel

# pick the model type
model_type = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_type)
model = AutoModel.from_pretrained(model_type)

# new tokens
new_tokens = ["new_token"]

# check if the tokens are already in the vocabulary
new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())

# add the tokens to the tokenizer vocabulary
tokenizer.add_tokens(list(new_tokens))

# add new, random embeddings for the new tokens
model.resize_token_embeddings(len(tokenizer))
alvas
  • 115,346
  • 109
  • 446
  • 738