Huggingface saving tokenizer

Question

I am trying to save the tokenizer in huggingface so that I can load it later from a container where I don't need access to the internet.

BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_vocabulary("./models/tokenizer/")
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")

However, the last line is giving the error:

OSError: Can't load config for './models/tokenizer3/'. Make sure that:

- './models/tokenizer3/' is a correct model identifier listed on 'https://huggingface.co/models'

- or './models/tokenizer3/' is the correct path to a directory containing a config.json file

transformers version: 3.1.0

How to load the saved tokenizer from pretrained model in Pytorch didn't help unfortunately.

Edit 1

Thanks to @ashwin's answer below I tried save_pretrained instead, and I get the following error:

OSError: Can't load config for './models/tokenizer/'. Make sure that:

- './models/tokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'

- or './models/tokenizer/' is the correct path to a directory containing a config.json file

the contents of the tokenizer folder is below:

I tried renaming tokenizer_config.json to config.json and then I got the error:

ValueError: Unrecognized model in ./models/tokenizer/. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, pegasus, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder

Ashwin Geet D'Sa · Accepted Answer · 2020-10-28T09:27:52.740

25

save_vocabulary(), saves only the vocabulary file of the tokenizer (List of BPE tokens).

To save the entire tokenizer, you should use save_pretrained()

Thus, as follows:

BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_pretrained("./models/tokenizer/")
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")

Edit:

for some unknown reason: instead of

tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")

using

tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")

works.

edited Oct 28 '20 at 09:27

answered Oct 27 '20 at 10:41

Ashwin Geet D'Sa

6,346
2
31
59

Tried looking into it. It seems like a bug. And as you have figured out it saves tokenizer_config.json and expects config.json. – Ashwin Geet D'Sa Oct 28 '20 at 09:15
1

As a workaround, since you are not modifying the tokenizer, you get model using `from_pretrained`, then save the model. You can also load the tokenizer from the saved model. This should be a tentative workaround. – Ashwin Geet D'Sa Oct 28 '20 at 09:21
Please check out the modification. – Ashwin Geet D'Sa Oct 28 '20 at 09:28
1

@sachinruk: Just in case you have to work with the AutoTokenizers, you have to save the corresponding config as shown [here](https://stackoverflow.com/questions/62472238/autotokenizer-from-pretrained-fails-to-load-locally-saved-pretrained-tokenizer/62664374#62664374). – cronoik Oct 28 '20 at 21:34
@cronoik, I checked your answer in the other post. However, I was curious to know if there is any raised issue on github? I could not find any issue concerning this problem. – Ashwin Geet D'Sa Oct 28 '20 at 22:56
@AshwinGeetD'Sa Yes, there is. I have linked to it in the first sentence :) But the [issue](https://github.com/huggingface/transformers/issues/4197) was closed. I will reopen it tomorrow and provide a patch. – cronoik Oct 28 '20 at 23:00
That's awesome :) – Ashwin Geet D'Sa Oct 28 '20 at 23:13
excuse me does evaluate the captioning for the images will be apply to testing for example if i'm using coco ? – user5520049 Jul 07 '21 at 22:27

score 2 · Answer 2 · answered Apr 07 '21 at 00:18

2

Renaming "tokenizer_config.json" file -- the one created by save_pretrained() function -- to "config.json" solved the same issue on my environment.

answered Apr 07 '21 at 00:18

Anıl Gürbüz

21
2

score 2 · Answer 3 · answered May 16 '21 at 16:13

2

You need to save both your model and tokenizer in the same directory. HuggingFace is actually looking for the config.json file of your model, so renaming the tokenizer_config.json would not solve the issue

answered May 16 '21 at 16:13

Moein Shariatnia

21
1

tokenizer.save_pretrained("/home/pchhapolika/Bert_multilingual_exp_TCM/model_mlm_exp1") produces 4 files when I add new tokens. ideally when I save tokenizer it should produce only one tokenizer.json file? – MAC Mar 11 '22 at 18:45

Huggingface saving tokenizer

Edit 1

3 Answers3

Linked