How to add new special token to the tokenizer?

Question

I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased).

QUERY: I want to ask a question.
ANSWER: Sure, ask away.
QUERY: How is the weather today?
ANSWER: It is nice and sunny.
QUERY: Okay, nice to know.
ANSWER: Would you like to know anything else?

Apart from this I have two more inputs.

I was wondering if I should put special token in the conversation to make it more meaning to the BERT model, like:

[CLS]QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else? [SEP]

But I am not able to add a new [EOT] special token.
Or should I use [SEP] token for this?

EDIT: steps to reproduce

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids)    # --> [100, 102, 0, 101, 103]

num_added_toks = tokenizer.add_tokens(['[EOT]'])
model.resize_token_embeddings(len(tokenizer))  # --> Embedding(30523, 768)

tokenizer.convert_tokens_to_ids('[EOT]')  # --> 30522

text_to_encode = '''QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else?'''

enc = tokenizer.encode_plus(
  text_to_encode,
  max_length=128,
  add_special_tokens=True,
  return_token_type_ids=False,
  return_attention_mask=False,
)['input_ids']

print(tokenizer.convert_ids_to_tokens(enc))

Result:

['[CLS]', 'query', ':', 'i', 'want', 'to', 'ask', 'a', 'question', '.', '[', 'e', '##ot', ']', 'answer', ':', 'sure', ',', 'ask', 'away', '.', '[', 'e', '##ot', ']', 'query', ':', 'how', 'is', 'the', 'weather', 'today', '?', '[', 'e', '##ot', ']', 'answer', ':', 'it', 'is', 'nice', 'and', 'sunny', '.', '[', 'e', '##ot', ']', 'query', ':', 'okay', ',', 'nice', 'to', 'know', '.', '[', 'e', '##ot', ']', 'answer', ':', 'would', 'you', 'like', 'to', 'know', 'anything', 'else', '?', '[SEP]']

Ashwin Geet D'Sa · Accepted Answer · 2021-09-15T21:51:21.190

11

As the intention of the [SEP] token was to act as a separator between two sentence, it fits your objective of using [SEP] token to separate sequences of QUERY and ANSWER.

You also try to add different tokens to mark the beginning and end of QUERY or ANSWER as <BOQ> and <EOQ> to mark the beginning and end of QUERY. Likewise, <BOA> and <EOA> to mark the beginning and end of ANSWER.

Sometimes, using the existing token works much better than adding new tokens to the vocabulary, as it requires huge number of training iterations as well as the data to learn the new token embedding.

However, if you want to add a new token if your application demands so, then it can be added as follows:

num_added_toks = tokenizer.add_tokens(['[EOT]'], special_tokens=True) ##This line is updated
model.resize_token_embeddings(len(tokenizer))

###The tokenizer has to be saved if it has to be reused
tokenizer.save_pretrained(<output_dir>)

edited Sep 15 '21 at 21:51

answered Sep 15 '21 at 14:07

Ashwin Geet D'Sa

6,346
2
31
59

i have added [EOT] token to the tokenizer using add_tokens. then i added [EOT] in data after every turn. but while tokenizing it is breaking [EOT] as `'[', 'e', '##ot', ']',` – sid8491 Sep 15 '21 at 18:29
Can you please share a small reproducible snippet? – Ashwin Geet D'Sa Sep 15 '21 at 19:03
i have added steps in question detail. let me know if any confusion. appreciate the help. – sid8491 Sep 15 '21 at 19:39
Hi, I found the error. Since, `[EOT]`, was added as a special token, we had to use `special_tokens=True` as a parameter. This prevents from lowercasing the text, as after lowercasing, the added token will not be found in the vocabulary. – Ashwin Geet D'Sa Sep 15 '21 at 21:52
yes, it's working flawlessly now. only question remains is should I use , or , or – sid8491 Sep 16 '21 at 04:10
1

It's a little tricky.... If you have sufficient data to train the system you can go with & .... But there is no one perfect answer for what is 'sufficient amount' of data..... So,its more of empirical approach, you just try both of them.... – Ashwin Geet D'Sa Sep 16 '21 at 08:55

score 3 · Answer 2 · answered Jun 14 '23 at 08:57

You should add it as a special token, not as a normal token, i.e. use "add_special_tokens" method instead of "add_tokens" method.

Here is a code example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

print("Before")
print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids)    # --> [100, 102, 0, 101, 103]


special_tokens_dict = {'additional_special_tokens': ['[EOT]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
# model.resize_token_embeddings(len(tokenizer))  # --> Embedding(30523, 768)

tok_id = tokenizer.convert_tokens_to_ids('[EOT]')  # --> 30522

print("After")
print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids)    # --> [100, 102, 0, 101, 103]

Then, to encode the text, we use:

text_to_encode = '''QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else?'''

enc = tokenizer.encode_plus(
  text_to_encode,
  max_length=128,
truncation=True,
  add_special_tokens=True,
  return_token_type_ids=False,
  return_attention_mask=False,
)['input_ids']

tokenizer.convert_ids_to_tokens(enc)

To get back the original text without the special tokens:

tokenizer.convert_ids_to_tokens(enc,skip_special_tokens = True)

How to add new special token to the tokenizer?

2 Answers2

Linked