Load a pre-trained model from disk with Huggingface Transformers

Question

From the documentation for from_pretrained, I understand I don't have to download the pretrained vectors every time, I can save them and load from disk with this syntax:

  - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
  - (not applicable to all derived classes, deprecated) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.

So, I went to the model hub:

https://huggingface.co/models

I found the model I wanted:

https://huggingface.co/bert-base-cased

I downloaded it from the link they provided to this repository:

Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is case-sensitive: it makes a difference between english and English.

Stored it in:

  /my/local/models/cased_L-12_H-768_A-12/

Which contains:

 ./
 ../
 bert_config.json
 bert_model.ckpt.data-00000-of-00001
 bert_model.ckpt.index
 bert_model.ckpt.meta
 vocab.txt

So, now I have the following:

  PATH = '/my/local/models/cased_L-12_H-768_A-12/'
  tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)

And I get this error:

>           raise EnvironmentError(msg)
E           OSError: Can't load config for '/my/local/models/cased_L-12_H-768_A-12/'. Make sure that:
E           
E           - '/my/local/models/cased_L-12_H-768_A-12/' is a correct model identifier listed on 'https://huggingface.co/models'
E           
E           - or '/my/local/models/cased_L-12_H-768_A-12/' is the correct path to a directory containing a config.json file

Similarly for when I link to the config.json directly:

  PATH = '/my/local/models/cased_L-12_H-768_A-12/bert_config.json'
  tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)

        if state_dict is None and not from_tf:
            try:
                state_dict = torch.load(resolved_archive_file, map_location="cpu")
            except Exception:
                raise OSError(
>                   "Unable to load weights from pytorch checkpoint file. "
                    "If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. "
                )
E               OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

What should I do differently to get huggingface to use my local pretrained model?

Update to address the comments

YOURPATH = '/somewhere/on/disk/'

name = 'transfo-xl-wt103'
tokenizer = TransfoXLTokenizerFast(name)
model = TransfoXLModel.from_pretrained(name)
tokenizer.save_pretrained(YOURPATH)
model.save_pretrained(YOURPATH)

>>> Please note you will not be able to load the save vocabulary in Rust-based TransfoXLTokenizerFast as they don't share the same structure.
('/somewhere/on/disk/vocab.bin', '/somewhere/on/disk/special_tokens_map.json', '/somewhere/on/disk/added_tokens.json')

So all is saved, but then....

YOURPATH = '/somewhere/on/disk/'
TransfoXLTokenizerFast.from_pretrained('transfo-xl-wt103', cache_dir=YOURPATH, local_files_only=True)

    "Cannot find the requested files in the cached path and outgoing traffic has been"
ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.

Not sure where you got these files from. When I check the link, I can download the following files: `config.json`, `flax_model.msgpack`, `modelcard.json`, `pytorch_model.bin`, `tf_model.h5`, `vocab.txt`. Also, it is better to save the files via `tokenizer.save_pretrained('YOURPATH')` and `model.save_pretrained('YOURPATH')` instead of downloading it directly. — cronoik, Oct 04 '20 at 21:59
Thank you. I have updated the question to reflect that I tried this and it did not seem to work. — Mittenchops, Oct 05 '20 at 18:38
Please use `TransfoXLTokenizerFast.from_pretrained(YOURPATH)`. — cronoik, Oct 06 '20 at 04:08
@Mittenchops did you ever solve this? I'm having similar difficulty loading a model from disk. — Evan Zamir, Mar 04 '21 at 20:19
I had the same issue when I used a relative path (i.e. ```./data/bert-large-uncased/```), but when I went to absolute path (i.e. ```/opt/workspace/data/bert-large-uncased/```) it miraculously worked — Catsbergers, Jan 27 '22 at 19:53
For what its worth, I used to the pathlib to overcome this. I could load the model from the file its self, but when used in a class as part of a larger app in a docker container, I have to use the full path to instantiate everything. — 1extralime, Jul 18 '22 at 19:08

score 19 · Answer 1 · answered Sep 22 '20 at 09:50

Where is the file located relative to your model folder? I believe it has to be a relative PATH rather than an absolute one. So if your file where you are writing the code is located in 'my/local/', then your code should be like so:

PATH = 'models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)

You just need to specify the folder where all the files are, and not the files directly. I think this is definitely a problem with the PATH. Try changing the style of "slashes": "/" vs "\", these are different in different operating systems. Also try using ".", like so ./models/cased_L-12_H-768_A-12/ etc.

Sorry, this actually was an absolute path, just mangled when I changed it for an example. I updated the question. — Mittenchops, Sep 23 '20 at 13:26

score 7 · Answer 2 · answered May 13 '21 at 21:08

I had this same need and just got this working with Tensorflow on my Linux box so figured I'd share.

My requirements.txt file for my code environment:

tensorflow==2.2.0
Keras==2.4.3
scikit-learn==0.23.1
scipy==1.4.1
numpy==1.18.1
opencv-python==4.5.1.48
seaborn==0.11.1
tensorflow-hub==0.12.0
nltk==3.6.2
tqdm==4.60.0
transformers==4.6.0
ipywidgets==7.6.3

I'm using Python 3.6.

I went to this site here which shows the directory tree for the specific huggingface model I wanted. I happened to want the uncased model, but these steps should be similar for your cased version. Also note that my link is to a very specific commit of this model, just for the sake of reproducibility - there will very likely be a more up-to-date version by the time someone reads this.

I manually downloaded (or had to copy/paste into notepad++ because the download button took me to a raw version of the txt / json in some cases... odd...) the following files:

config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt

NOTE: Once again, all I'm using is Tensorflow, so I didn't download the Pytorch weights. If you're using Pytorch, you'll likely want to download those weights instead of the tf_model.h5 file.

I then put those files in this directory on my Linux box:

/opt/word_embeddings/bert-base-uncased/

Probably a good idea to make sure there's at least read permissions on all of these files as well with a quick ls -la (my permissions on each file are -rw-r--r--). I also have execute permissions on the parent directory (the one listed above) so people can cd to this dir.

From there, I'm able to load the model like so:

tokenizer:

# python
from transformers import BertTokenizer
# tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("/opt/word_embeddings/bert-base-uncased/")

layer/model weights:

# python
from transformers import TFAutoModel
# bert = TFAutoModel.from_pretrained("bert-base-uncased")
bert = TFAutoModel.from_pretrained("/opt/word_embeddings/bert-base-uncased/")

score 2 · Answer 3 · answered Aug 13 '22 at 06:39

2

This should be quite easy on Windows 10 using relative path. Assuming your pre-trained (pytorch based) transformer model is in 'model' folder in your current working directory, following code can load your model.

from transformers import AutoModel model = AutoModel.from_pretrained('.\model',local_files_only=True)

Please note the 'dot' in '.\model'. Missing it will make the code unsuccessful.

answered Aug 13 '22 at 06:39

Sheraz

2,149
1
8
4

This worked for me. Off course relative path works on any OS since long before I was born (and I'm really old), but +1 because the code works. – Jelmer Wind Dec 07 '22 at 12:58

Milad Ce · Answer 4 · 2021-10-11T21:54:56.943

In addition to config file and vocab file, you need to add tf/torch model (which has.h5/.bin extension) to your directory.

in your case, torch and tf models maybe located in these url:

torch model: https://cdn.huggingface.co/bert-base-cased-pytorch_model.bin

tf model: https://cdn.huggingface.co/bert-base-cased-tf_model.h5

you can also find all required files in files and versions section of your model: https://huggingface.co/bert-base-cased/tree/main

score 0 · Answer 5 · answered Feb 03 '22 at 01:19

bert model folder containd these files:

config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt

instaed of these if we require bert_config.json

 bert_model.ckpt.data-00000-of-00001
 bert_model.ckpt.index
 bert_model.ckpt.meta
 vocab.txt

then how to do

score 0 · Answer 6 · answered Jun 12 '22 at 13:14

Here is a short ans.

tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt',local_files_only=True)
model = BertForMaskedLM.from_pretrained('/path/to/pytorch_model.bin',config='../config.json', local_files_only=True)

Usually config.json need not be supplied explicitly if it resides in the same dir.

Load a pre-trained model from disk with Huggingface Transformers

Update to address the comments

6 Answers6

Linked