1

I'd like to turn off the warning that huggingface is generating when I use unique_no_split_tokens

In[2]   tokenizer = T5Tokenizer.from_pretrained("t5-base")
In[3]   tokenizer(" ".join([f"<extra_id_{n}>" for n in range(1,101)]), return_tensors="pt").input_ids.size()
Out[3]: torch.Size([1, 100])
    Using bos_token, but it is not set yet.
    Using cls_token, but it is not set yet.
    Using mask_token, but it is not set yet.
    Using sep_token, but it is not set yet.

Anyone know how to do this?

Code True
  • 636
  • 7
  • 18
  • if you want so add standard special tokens see: https://stackoverflow.com/questions/73322462/how-to-add-all-standard-special-tokens-to-my-hugging-face-tokenizer-and-model?noredirect=1&lq=1 – Charlie Parker Aug 12 '22 at 13:06

1 Answers1

1

This solution worked for me:

tokenizer.add_tokens([f"_{n}" for n in range(1,100)], special_tokens=True)
model.resize_token_embeddings(len(tokenizer))
tokenizer.save_pretrained('pathToExtendedTokenizer/')
tokenizer = T5Tokenizer.from_pretrained("pathToExtendedTokenizer/")
Code True
  • 636
  • 7
  • 18
  • I don't get it, why are you adding random tokens labeled `_{n}`? – Charlie Parker Aug 11 '22 at 14:51
  • the default seems to be a different string for the extra ids: `tokenizer.all_special_tokens=['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ...] ` – Charlie Parker Aug 11 '22 at 14:53
  • why not ` tokenizer.add_tokens([f"" for n in range(1, 100)], special_tokens=True) ` – Charlie Parker Aug 11 '22 at 14:57
  • why do you say this worked for you? This did not work for me. What is work for you? Can you paste the output of your script? – Charlie Parker Aug 11 '22 at 15:02
  • generating/parsing sequences with units _xy was more convenient for my application (where x,y are elements on [1,2,...,9]), e.g. _23 _56 ... – Code True Aug 11 '22 at 15:02