1

I am enjoying experimenting with different transformers from the excellent 'Huggingface' library. However, I receive the following error message when I attempt to use any kind of 'roberta'/ 'xlm' transformers. My Python code seems to work just fine with bert-base and bert-large models , so I want to understand how I might need to adjust it to work with these variants.

Exception: WordPiece error: Missing [UNK] token from the vocabulary

My code adds a fine-tuning layer on top of the pre-trained BERT model. All the bert models I have used previously have no problem tokenizing and processing the English language text data I am analysing. My Python knowledge is growing but I would describe it as solid basics but patchy above this level. Please help me to better understand the issue arsing here so I can make the necessary adjustments, With thanks - Mark

Here is the full error message, if that helps.

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-61-d42d72a742f6> in <module>()
      5     pad_to_max_length=True,
      6     truncation=True,
----> 7     return_token_type_ids=False
      8 )
      9 


2 frames

/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in encode_batch(self, inputs, is_pretokenized, add_special_tokens)
    247             raise ValueError("encode_batch: `inputs` can't be `None`")
    248 
--> 249         return self._tokenizer.encode_batch(inputs, is_pretokenized, add_special_tokens)
    250 
    251     def decode(self, ids: List[int], skip_special_tokens: Optional[bool] = True) -> str:

Exception: WordPiece error: Missing [UNK] token from the vocabulary
  • Can you share few lines of code? – Ashwin Geet D'Sa Jan 04 '21 at 09:14
  • Hi Ashwin, firstly I am using version 3 of transformers as my code produced other error messages...!pip install -q transformers==3.0.0 - I don't know whether this problem has been addressed in later versions . I think the error message arises in this cell (COLAB not available) best_valid_loss = float('inf') train_losses=[] valid_losses=[] for epoch in range(epochs): print('\n Epoch {:} / {:}'.format(epoch + 1, epochs)) #train model train_loss, _ = fine_tune() #evaluate model valid_loss, _ = evaluate() – Mark Padley Jan 04 '21 at 09:31

0 Answers0