1

I have a model trained to disk with a slow tokenizer:

from transformers import convert_slow_tokenizer
from transformers import BertTokenizer, BertForSequenceClassificationa

mybert = BertForSequenceClassification.from_pretrained(PATH,
                                                        local_files_only=True,
                                                        )
tokenizer = BertTokenizer.from_pretrained(PATH, 
                                          local_files_only=True, 
                                          use_fast=True)

I am able to use it to tokenize like so:

tokenized_example = tokenizer(
    mytext,
    max_length=100,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=50
)

However, it is non-fast:

tokenized_example.is_fast
False

I try to convert it to fast one, which looks successful

tokenizer = convert_slow_tokenizer.convert_slow_tokenizer(tokenizer)

However, now running this gives me:

tokenized_example = tokenizer(
    mytext,
    max_length=100,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=50
)


TypeError: 'tokenizers.Tokenizer' object is not callable

How can I convert this slow tokenizer to a fast one?

I have seen this answer and I have sentencepiece installed---this did not fix my issue.

smci
  • 32,567
  • 20
  • 113
  • 146
Mittenchops
  • 18,633
  • 33
  • 128
  • 246

1 Answers1

0

@Mittenchops, the reason tokenized_example.is_fast is False, is obviously because it isn't a "FastTokenizer". Rather than converting a slow tokenizer, you can huggingface's FastTokenizer instead.

Your code would look something like this:

from transformers import convert_slow_tokenizer
from transformers import BertTokenizerFast, BertForSequenceClassification

mybert = BertForSequenceClassification.from_pretrained(PATH,
                                                local_files_only=True,)
tokenizer = BertTokenizerFast.from_pretrained(PATH, 
                                      local_files_only=True, 
                                      use_fast=True)

tokenized_example = tokenizer(
mytext,
max_length=100,
truncation="only_second",
return_overflowing_tokens=True,
stride=50)

# In this case: tokenized_example.is_fast will yield True
Hemant Rakesh
  • 126
  • 1
  • 5