Got the "Unable to load vocabulary from file." while using pipelines

Question

I have been trying to use the "csebuetnlp/mT5_multilingual_XLSum" model for summarization purposes.
The code I tried is listed as below:


!pip install transformers
!pip install sentencepiece
import transformers
text_example = """ 
En düşük emekli aylığının 5 bin 500 liradan 7 bin 500 liraya yükseltilmesi için TBMM'de yasal düzenleme yapılacak; ardından zamlı aylıkların nisan ayında hesaplara aktarılması planlanıyor.

AKP'li Cumhurbaşkanı Recep Tayyip Erdoğan'ın dün katıldığı televizyon programında en düşük emekli aylığının 7 bin 500 liraya yükseltildiği yönündeki açıklaması, emekliler tarafından memnuniyetle karşılandı.

Bu müjdenin ardından gözler, söz konusu kararın uygulanması için TBMM'de yapılacak yasal düzenlemeye çevrildi.

En düşük emekli aylığının 7 bin 500 liraya yükseltilmesi yönündeki kararın ilerleyen günlerde yasalaşması ve zamlı aylıkların nisan ayında hesaplara yatırılması bekleniyor.

Söz konusu artıştan, EYT düzenlemesiyle emekli olanlarla beraber emeklilerin yarısından fazlası yararlanacak.
"""

from transformers import pipeline
summarizer = pipeline("summarization", model= "csebuetnlp/mT5_multilingual_XLSum")
summarizer(text_example)

The output I got is listed as below:

Requirement already satisfied: transformers in d:\anaaac\lib\site-packages (4.24.0)
Requirement already satisfied: regex!=2019.12.17 in d:\anaaac\lib\site-packages (from transformers) (2023.3.23)
Requirement already satisfied: pyyaml>=5.1 in d:\anaaac\lib\site-packages (from transformers) (6.0)
Requirement already satisfied: requests in d:\anaaac\lib\site-packages (from transformers) (2.28.1)
Requirement already satisfied: tqdm>=4.27 in d:\anaaac\lib\site-packages (from transformers) (4.65.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.10.0 in d:\anaaac\lib\site-packages (from transformers) (0.11.0)
Requirement already satisfied: packaging>=20.0 in d:\anaaac\lib\site-packages (from transformers) (23.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in d:\anaaac\lib\site-packages (from transformers) (0.13.2)
Requirement already satisfied: filelock in d:\anaaac\lib\site-packages (from transformers) (3.9.0)
Requirement already satisfied: numpy>=1.17 in d:\anaaac\lib\site-packages (from transformers) (1.24.2)
Requirement already satisfied: typing-extensions>=3.7.4.3 in d:\anaaac\lib\site-packages (from huggingface-hub<1.0,>=0.10.0->transformers) (4.4.0)
Requirement already satisfied: colorama in d:\anaaac\lib\site-packages (from tqdm>=4.27->transformers) (0.4.6)
Requirement already satisfied: certifi>=2017.4.17 in d:\anaaac\lib\site-packages (from requests->transformers) (2022.12.7)
Requirement already satisfied: charset-normalizer<3,>=2 in d:\anaaac\lib\site-packages (from requests->transformers) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in d:\anaaac\lib\site-packages (from requests->transformers) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in d:\anaaac\lib\site-packages (from requests->transformers) (1.26.14)
Requirement already satisfied: sentencepiece in d:\anaaac\lib\site-packages (0.1.97)
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
File D:\anaaac\lib\site-packages\transformers\tokenization_utils_base.py:1932, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, *init_inputs, **kwargs)
   1931 try:
-> 1932     tokenizer = cls(*init_inputs, **init_kwargs)
   1933 except OSError:

File D:\anaaac\lib\site-packages\transformers\models\t5\tokenization_t5.py:155, in T5Tokenizer.__init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, **kwargs)
    154 self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
--> 155 self.sp_model.Load(vocab_file)

File D:\anaaac\lib\site-packages\sentencepiece\__init__.py:905, in SentencePieceProcessor.Load(self, model_file, model_proto)
    904   return self.LoadFromSerializedProto(model_proto)
--> 905 return self.LoadFromFile(model_file)

File D:\anaaac\lib\site-packages\sentencepiece\__init__.py:310, in SentencePieceProcessor.LoadFromFile(self, arg)
    309 def LoadFromFile(self, arg):
--> 310     return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

OSError: Not found: "C:\Users\ist/.cache\huggingface\hub\models--csebuetnlp--mT5_multilingual_XLSum\snapshots\2437a524effdbadc327ced84595508f1e32025b3\spiece.model": No such file or directory Error #2

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
Cell In[4], line 17
      4 text_example = """ 
      5 En düşük emekli aylığının 5 bin 500 liradan 7 bin 500 liraya yükseltilmesi için TBMM'de yasal düzenleme yapılacak; ardından zamlı aylıkların nisan ayında hesaplara aktarılması planlanıyor.
      6 
   (...)
     13 Söz konusu artıştan, EYT düzenlemesiyle emekli olanlarla beraber emeklilerin yarısından fazlası yararlanacak.
     14 """
     16 from transformers import pipeline
---> 17 summarizer = pipeline("summarization", model= "csebuetnlp/mT5_multilingual_XLSum")
     18 summarizer(text_example)

File D:\anaaac\lib\site-packages\transformers\pipelines\__init__.py:801, in pipeline(task, model, config, tokenizer, feature_extractor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
    798             tokenizer_identifier = tokenizer
    799             tokenizer_kwargs = model_kwargs
--> 801         tokenizer = AutoTokenizer.from_pretrained(
    802             tokenizer_identifier, use_fast=use_fast, _from_pipeline=task, **hub_kwargs, **tokenizer_kwargs
    803         )
    805 if load_feature_extractor:
    806     # Try to infer feature extractor from model or config name (if provided as str)
    807     if feature_extractor is None:

File D:\anaaac\lib\site-packages\transformers\models\auto\tokenization_auto.py:619, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    615     if tokenizer_class is None:
    616         raise ValueError(
    617             f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
    618         )
--> 619     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    621 # Otherwise we have to be creative.
    622 # if model is an encoder decoder, the encoder tokenizer class is used by default
    623 if isinstance(config, EncoderDecoderConfig):

File D:\anaaac\lib\site-packages\transformers\tokenization_utils_base.py:1777, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1774     else:
   1775         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 1777 return cls._from_pretrained(
   1778     resolved_vocab_files,
   1779     pretrained_model_name_or_path,
   1780     init_configuration,
   1781     *init_inputs,
   1782     use_auth_token=use_auth_token,
   1783     cache_dir=cache_dir,
   1784     local_files_only=local_files_only,
   1785     _commit_hash=commit_hash,
   1786     **kwargs,
   1787 )

File D:\anaaac\lib\site-packages\transformers\tokenization_utils_base.py:1807, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, *init_inputs, **kwargs)
   1805 has_tokenizer_file = resolved_vocab_files.get("tokenizer_file", None) is not None
   1806 if (from_slow or not has_tokenizer_file) and cls.slow_tokenizer_class is not None:
-> 1807     slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
   1808         copy.deepcopy(resolved_vocab_files),
   1809         pretrained_model_name_or_path,
   1810         copy.deepcopy(init_configuration),
   1811         *init_inputs,
   1812         use_auth_token=use_auth_token,
   1813         cache_dir=cache_dir,
   1814         local_files_only=local_files_only,
   1815         _commit_hash=_commit_hash,
   1816         **(copy.deepcopy(kwargs)),
   1817     )
   1818 else:
   1819     slow_tokenizer = None

File D:\anaaac\lib\site-packages\transformers\tokenization_utils_base.py:1934, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, *init_inputs, **kwargs)
   1932     tokenizer = cls(*init_inputs, **init_kwargs)
   1933 except OSError:
-> 1934     raise OSError(
   1935         "Unable to load vocabulary from file. "
   1936         "Please check that the provided vocabulary is accessible and not corrupted."
   1937     )
   1939 # Save inputs and kwargs for saving and re-loading with ``save_pretrained``
   1940 # Removed: Now done at the base class level
   1941 # tokenizer.init_inputs = init_inputs
   1942 # tokenizer.init_kwargs = init_kwargs
   1943 
   1944 # If there is a complementary special token map, load it
   1945 special_tokens_map_file = resolved_vocab_files.pop("special_tokens_map_file", None)

OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

The text is about the raise for the retirement wage in Turkey.

This part of the output is really weird considering the spiece.model file exist in the exact same directory.

OSError: Not found: "C:\Users\ist/.cache\huggingface\hub\models--csebuetnlp--mT5_multilingual_XLSum\snapshots\2437a524effdbadc327ced84595508f1e32025b3\spiece.model": No such file or directory Error #2

Version info:

Server Information:

You are using Jupyter Notebook.

The version of the notebook server is: **6.5.3**  
The server is running on this version of Python:

`Python 3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)]`

I tried upgrading transformers sentencepiece, it didnt help.
I tried using the models named sshleifer/distilbart-cnn-12-6 and flax-community/t5-base-cnn-dm. They both worked as expected but i need a multilingual model.
I tried running the same code in Google Colaboratory it worked as expected and the output is:

Downloading (…)lve/main/config.json: 100%
730/730 [00:00<00:00, 14.0kB/s]
Downloading pytorch_model.bin: 100%
2.33G/2.33G [00:23<00:00, 109MB/s]
Downloading (…)okenizer_config.json: 100%
375/375 [00:00<00:00, 9.51kB/s]
Downloading spiece.model: 100%
4.31M/4.31M [00:00<00:00, 14.0MB/s]
Downloading (…)cial_tokens_map.json: 100%
65.0/65.0 [00:00<00:00, 2.70kB/s]
/usr/local/lib/python3.9/dist-packages/transformers/convert_slow_tokenizer.py:446: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(
[{'summary_text': "Cumhurbaşkanı Recep Tayyip Erdoğan'ın dün açıkladığı en düşük emekli aylığının 7 bin 500 liraya yükseltilmesi yönündeki açıklaması, emekliler tarafından memnuniyetle karşılandı."}]

The output is not weird and the summary is understandable.

I belive the error is more about sentencepiece than pipelines. I checked some similar issues in github, stackoverflow and some chineese forums. None of the issues helped. I need some help with the code. Thanks.

Got the "Unable to load vocabulary from file." while using pipelines

0 Answers0