I am enjoying experimenting with different transformers from the excellent 'Huggingface' library. However, I receive the following error message when I attempt to use any kind of 'roberta'/ 'xlm' transformers. My Python code seems to work just fine with bert-base and bert-large models , so I want to understand how I might need to adjust it to work with these variants.
Exception: WordPiece error: Missing [UNK] token from the vocabulary
My code adds a fine-tuning layer on top of the pre-trained BERT model. All the bert models I have used previously have no problem tokenizing and processing the English language text data I am analysing. My Python knowledge is growing but I would describe it as solid basics but patchy above this level. Please help me to better understand the issue arsing here so I can make the necessary adjustments, With thanks - Mark
Here is the full error message, if that helps.
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-61-d42d72a742f6> in <module>()
5 pad_to_max_length=True,
6 truncation=True,
----> 7 return_token_type_ids=False
8 )
9
2 frames
/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in encode_batch(self, inputs, is_pretokenized, add_special_tokens)
247 raise ValueError("encode_batch: `inputs` can't be `None`")
248
--> 249 return self._tokenizer.encode_batch(inputs, is_pretokenized, add_special_tokens)
250
251 def decode(self, ids: List[int], skip_special_tokens: Optional[bool] = True) -> str:
Exception: WordPiece error: Missing [UNK] token from the vocabulary