Convert HuggingFace slow tokenizer to a fast tokenizer

Question

I have a model trained to disk with a slow tokenizer:

from transformers import convert_slow_tokenizer
from transformers import BertTokenizer, BertForSequenceClassificationa

mybert = BertForSequenceClassification.from_pretrained(PATH,
                                                        local_files_only=True,
                                                        )
tokenizer = BertTokenizer.from_pretrained(PATH, 
                                          local_files_only=True, 
                                          use_fast=True)

I am able to use it to tokenize like so:

tokenized_example = tokenizer(
    mytext,
    max_length=100,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=50
)

However, it is non-fast:

tokenized_example.is_fast
False

I try to convert it to fast one, which looks successful

tokenizer = convert_slow_tokenizer.convert_slow_tokenizer(tokenizer)

However, now running this gives me:

tokenized_example = tokenizer(
    mytext,
    max_length=100,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=50
)


TypeError: 'tokenizers.Tokenizer' object is not callable

How can I convert this slow tokenizer to a fast one?

I have seen this answer and I have sentencepiece installed---this did not fix my issue.

Have you tried the suggestion from this issue: https://github.com/huggingface/tokenizers/issues/424 — Jeremy, Mar 01 '22 at 18:35

score 0 · Answer 1 · answered Mar 10 '22 at 17:38

@Mittenchops, the reason tokenized_example.is_fast is False, is obviously because it isn't a "FastTokenizer". Rather than converting a slow tokenizer, you can huggingface's FastTokenizer instead.

Your code would look something like this:

from transformers import convert_slow_tokenizer
from transformers import BertTokenizerFast, BertForSequenceClassification

mybert = BertForSequenceClassification.from_pretrained(PATH,
                                                local_files_only=True,)
tokenizer = BertTokenizerFast.from_pretrained(PATH, 
                                      local_files_only=True, 
                                      use_fast=True)

tokenized_example = tokenizer(
mytext,
max_length=100,
truncation="only_second",
return_overflowing_tokens=True,
stride=50)

# In this case: tokenized_example.is_fast will yield True

This is just using the fast tokenizer. You are not converting the slow tokenizer to a fast tokenizer. — Mittenchops, Mar 11 '22 at 02:58
Yes, why convert it when you already have it? Unless you're experimenting with conversion — Hemant Rakesh, Mar 11 '22 at 07:26
@Mittenchops you can refer to https://discuss.huggingface.co/t/error-with-new-tokenizers-urgent/2847/5, if it works for you — Hemant Rakesh, Mar 13 '22 at 15:51

Convert HuggingFace slow tokenizer to a fast tokenizer

1 Answers1

Linked