Huggingface Whitespace tokenizer not "fast"

Question

I want to run NER on pre-tokenized text, and have the following code:

from tokenizers.pre_tokenizers import Whitespace
#from transformers import convert_slow_tokenizer
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
wstok = Whitespace()
#wstok = convert_slow_tokenizer.convert_slow_tokenizer(wstok)
ner_pipe = pipeline("ner", model=model, tokenizer=wstok)
tokens = ['Some', 'example', 'tokens', 'here', '.']
entities = ner_pipe(' '.join(tokens))

Which gives me the following error:

AttributeError: 'tokenizers.pre_tokenizers.Whitespace' object has no attribute 'is_fast'

Seems to me that plain and simple whitespace tokenization should be pretty "fast", but that's probably not what they mean here :).

I've seen this post (hence the commented out lines in the code snippet), but that tells me that the Whitespace class is not among the ones that can be converted.

Anyone any ideas on how I can get a fast Whitespace tokenizer in huggingface?

Can you try to add the `is_fast` manually to the tokenizer? (Like `wstok.is_fast = False`). It may related with some bug in Huggingface or Keras. — stuck, Mar 23 '22 at 13:31
wstok.is_fast = False results in "TypeError: 'tokenizers.pre_tokenizers.Whitespace' object is not callable" — Igor, Mar 23 '22 at 13:37

Huggingface Whitespace tokenizer not "fast"

0 Answers0