25

I am trying to save the tokenizer in huggingface so that I can load it later from a container where I don't need access to the internet.

BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_vocabulary("./models/tokenizer/")
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")

However, the last line is giving the error:

OSError: Can't load config for './models/tokenizer3/'. Make sure that:

- './models/tokenizer3/' is a correct model identifier listed on 'https://huggingface.co/models'

- or './models/tokenizer3/' is the correct path to a directory containing a config.json file

transformers version: 3.1.0

How to load the saved tokenizer from pretrained model in Pytorch didn't help unfortunately.

Edit 1

Thanks to @ashwin's answer below I tried save_pretrained instead, and I get the following error:

OSError: Can't load config for './models/tokenizer/'. Make sure that:

- './models/tokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'

- or './models/tokenizer/' is the correct path to a directory containing a config.json file

the contents of the tokenizer folder is below: enter image description here

I tried renaming tokenizer_config.json to config.json and then I got the error:

ValueError: Unrecognized model in ./models/tokenizer/. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, pegasus, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder
sachinruk
  • 9,571
  • 12
  • 55
  • 86

3 Answers3

25

save_vocabulary(), saves only the vocabulary file of the tokenizer (List of BPE tokens).

To save the entire tokenizer, you should use save_pretrained()

Thus, as follows:

BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_pretrained("./models/tokenizer/")
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")

Edit:

for some unknown reason: instead of

tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")

using

tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")

works.

Ashwin Geet D'Sa
  • 6,346
  • 2
  • 31
  • 59
  • Tried looking into it. It seems like a bug. And as you have figured out it saves tokenizer_config.json and expects config.json. – Ashwin Geet D'Sa Oct 28 '20 at 09:15
  • 1
    As a workaround, since you are not modifying the tokenizer, you get model using `from_pretrained`, then save the model. You can also load the tokenizer from the saved model. This should be a tentative workaround. – Ashwin Geet D'Sa Oct 28 '20 at 09:21
  • Please check out the modification. – Ashwin Geet D'Sa Oct 28 '20 at 09:28
  • 1
    @sachinruk: Just in case you have to work with the AutoTokenizers, you have to save the corresponding config as shown [here](https://stackoverflow.com/questions/62472238/autotokenizer-from-pretrained-fails-to-load-locally-saved-pretrained-tokenizer/62664374#62664374). – cronoik Oct 28 '20 at 21:34
  • @cronoik, I checked your answer in the other post. However, I was curious to know if there is any raised issue on github? I could not find any issue concerning this problem. – Ashwin Geet D'Sa Oct 28 '20 at 22:56
  • @AshwinGeetD'Sa Yes, there is. I have linked to it in the first sentence :) But the [issue](https://github.com/huggingface/transformers/issues/4197) was closed. I will reopen it tomorrow and provide a patch. – cronoik Oct 28 '20 at 23:00
  • That's awesome :) – Ashwin Geet D'Sa Oct 28 '20 at 23:13
  • excuse me does evaluate the captioning for the images will be apply to testing for example if i'm using coco ? – user5520049 Jul 07 '21 at 22:27
2

Renaming "tokenizer_config.json" file -- the one created by save_pretrained() function -- to "config.json" solved the same issue on my environment.

2

You need to save both your model and tokenizer in the same directory. HuggingFace is actually looking for the config.json file of your model, so renaming the tokenizer_config.json would not solve the issue

  • tokenizer.save_pretrained("/home/pchhapolika/Bert_multilingual_exp_TCM/model_mlm_exp1") produces 4 files when I add new tokens. ideally when I save tokenizer it should produce only one tokenizer.json file? – MAC Mar 11 '22 at 18:45