I have been trying to use the "csebuetnlp/mT5_multilingual_XLSum" model for summarization purposes.
The code I tried is listed as below:
!pip install transformers
!pip install sentencepiece
import transformers
text_example = """
En düşük emekli aylığının 5 bin 500 liradan 7 bin 500 liraya yükseltilmesi için TBMM'de yasal düzenleme yapılacak; ardından zamlı aylıkların nisan ayında hesaplara aktarılması planlanıyor.
AKP'li Cumhurbaşkanı Recep Tayyip Erdoğan'ın dün katıldığı televizyon programında en düşük emekli aylığının 7 bin 500 liraya yükseltildiği yönündeki açıklaması, emekliler tarafından memnuniyetle karşılandı.
Bu müjdenin ardından gözler, söz konusu kararın uygulanması için TBMM'de yapılacak yasal düzenlemeye çevrildi.
En düşük emekli aylığının 7 bin 500 liraya yükseltilmesi yönündeki kararın ilerleyen günlerde yasalaşması ve zamlı aylıkların nisan ayında hesaplara yatırılması bekleniyor.
Söz konusu artıştan, EYT düzenlemesiyle emekli olanlarla beraber emeklilerin yarısından fazlası yararlanacak.
"""
from transformers import pipeline
summarizer = pipeline("summarization", model= "csebuetnlp/mT5_multilingual_XLSum")
summarizer(text_example)
The output I got is listed as below:
Requirement already satisfied: transformers in d:\anaaac\lib\site-packages (4.24.0)
Requirement already satisfied: regex!=2019.12.17 in d:\anaaac\lib\site-packages (from transformers) (2023.3.23)
Requirement already satisfied: pyyaml>=5.1 in d:\anaaac\lib\site-packages (from transformers) (6.0)
Requirement already satisfied: requests in d:\anaaac\lib\site-packages (from transformers) (2.28.1)
Requirement already satisfied: tqdm>=4.27 in d:\anaaac\lib\site-packages (from transformers) (4.65.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.10.0 in d:\anaaac\lib\site-packages (from transformers) (0.11.0)
Requirement already satisfied: packaging>=20.0 in d:\anaaac\lib\site-packages (from transformers) (23.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in d:\anaaac\lib\site-packages (from transformers) (0.13.2)
Requirement already satisfied: filelock in d:\anaaac\lib\site-packages (from transformers) (3.9.0)
Requirement already satisfied: numpy>=1.17 in d:\anaaac\lib\site-packages (from transformers) (1.24.2)
Requirement already satisfied: typing-extensions>=3.7.4.3 in d:\anaaac\lib\site-packages (from huggingface-hub<1.0,>=0.10.0->transformers) (4.4.0)
Requirement already satisfied: colorama in d:\anaaac\lib\site-packages (from tqdm>=4.27->transformers) (0.4.6)
Requirement already satisfied: certifi>=2017.4.17 in d:\anaaac\lib\site-packages (from requests->transformers) (2022.12.7)
Requirement already satisfied: charset-normalizer<3,>=2 in d:\anaaac\lib\site-packages (from requests->transformers) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in d:\anaaac\lib\site-packages (from requests->transformers) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in d:\anaaac\lib\site-packages (from requests->transformers) (1.26.14)
Requirement already satisfied: sentencepiece in d:\anaaac\lib\site-packages (0.1.97)
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
File D:\anaaac\lib\site-packages\transformers\tokenization_utils_base.py:1932, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, *init_inputs, **kwargs)
1931 try:
-> 1932 tokenizer = cls(*init_inputs, **init_kwargs)
1933 except OSError:
File D:\anaaac\lib\site-packages\transformers\models\t5\tokenization_t5.py:155, in T5Tokenizer.__init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, **kwargs)
154 self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
--> 155 self.sp_model.Load(vocab_file)
File D:\anaaac\lib\site-packages\sentencepiece\__init__.py:905, in SentencePieceProcessor.Load(self, model_file, model_proto)
904 return self.LoadFromSerializedProto(model_proto)
--> 905 return self.LoadFromFile(model_file)
File D:\anaaac\lib\site-packages\sentencepiece\__init__.py:310, in SentencePieceProcessor.LoadFromFile(self, arg)
309 def LoadFromFile(self, arg):
--> 310 return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "C:\Users\ist/.cache\huggingface\hub\models--csebuetnlp--mT5_multilingual_XLSum\snapshots\2437a524effdbadc327ced84595508f1e32025b3\spiece.model": No such file or directory Error #2
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
Cell In[4], line 17
4 text_example = """
5 En düşük emekli aylığının 5 bin 500 liradan 7 bin 500 liraya yükseltilmesi için TBMM'de yasal düzenleme yapılacak; ardından zamlı aylıkların nisan ayında hesaplara aktarılması planlanıyor.
6
(...)
13 Söz konusu artıştan, EYT düzenlemesiyle emekli olanlarla beraber emeklilerin yarısından fazlası yararlanacak.
14 """
16 from transformers import pipeline
---> 17 summarizer = pipeline("summarization", model= "csebuetnlp/mT5_multilingual_XLSum")
18 summarizer(text_example)
File D:\anaaac\lib\site-packages\transformers\pipelines\__init__.py:801, in pipeline(task, model, config, tokenizer, feature_extractor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
798 tokenizer_identifier = tokenizer
799 tokenizer_kwargs = model_kwargs
--> 801 tokenizer = AutoTokenizer.from_pretrained(
802 tokenizer_identifier, use_fast=use_fast, _from_pipeline=task, **hub_kwargs, **tokenizer_kwargs
803 )
805 if load_feature_extractor:
806 # Try to infer feature extractor from model or config name (if provided as str)
807 if feature_extractor is None:
File D:\anaaac\lib\site-packages\transformers\models\auto\tokenization_auto.py:619, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
615 if tokenizer_class is None:
616 raise ValueError(
617 f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
618 )
--> 619 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
621 # Otherwise we have to be creative.
622 # if model is an encoder decoder, the encoder tokenizer class is used by default
623 if isinstance(config, EncoderDecoderConfig):
File D:\anaaac\lib\site-packages\transformers\tokenization_utils_base.py:1777, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1774 else:
1775 logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 1777 return cls._from_pretrained(
1778 resolved_vocab_files,
1779 pretrained_model_name_or_path,
1780 init_configuration,
1781 *init_inputs,
1782 use_auth_token=use_auth_token,
1783 cache_dir=cache_dir,
1784 local_files_only=local_files_only,
1785 _commit_hash=commit_hash,
1786 **kwargs,
1787 )
File D:\anaaac\lib\site-packages\transformers\tokenization_utils_base.py:1807, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, *init_inputs, **kwargs)
1805 has_tokenizer_file = resolved_vocab_files.get("tokenizer_file", None) is not None
1806 if (from_slow or not has_tokenizer_file) and cls.slow_tokenizer_class is not None:
-> 1807 slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
1808 copy.deepcopy(resolved_vocab_files),
1809 pretrained_model_name_or_path,
1810 copy.deepcopy(init_configuration),
1811 *init_inputs,
1812 use_auth_token=use_auth_token,
1813 cache_dir=cache_dir,
1814 local_files_only=local_files_only,
1815 _commit_hash=_commit_hash,
1816 **(copy.deepcopy(kwargs)),
1817 )
1818 else:
1819 slow_tokenizer = None
File D:\anaaac\lib\site-packages\transformers\tokenization_utils_base.py:1934, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, *init_inputs, **kwargs)
1932 tokenizer = cls(*init_inputs, **init_kwargs)
1933 except OSError:
-> 1934 raise OSError(
1935 "Unable to load vocabulary from file. "
1936 "Please check that the provided vocabulary is accessible and not corrupted."
1937 )
1939 # Save inputs and kwargs for saving and re-loading with ``save_pretrained``
1940 # Removed: Now done at the base class level
1941 # tokenizer.init_inputs = init_inputs
1942 # tokenizer.init_kwargs = init_kwargs
1943
1944 # If there is a complementary special token map, load it
1945 special_tokens_map_file = resolved_vocab_files.pop("special_tokens_map_file", None)
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.
The text is about the raise for the retirement wage in Turkey.
This part of the output is really weird considering the spiece.model file exist in the exact same directory.
OSError: Not found: "C:\Users\ist/.cache\huggingface\hub\models--csebuetnlp--mT5_multilingual_XLSum\snapshots\2437a524effdbadc327ced84595508f1e32025b3\spiece.model": No such file or directory Error #2
Version info:
Server Information:
You are using Jupyter Notebook.
The version of the notebook server is: **6.5.3**
The server is running on this version of Python:
`Python 3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)]`
- I tried upgrading transformers sentencepiece, it didnt help.
- I tried using the models named
sshleifer/distilbart-cnn-12-6
andflax-community/t5-base-cnn-dm
. They both worked as expected but i need a multilingual model. - I tried running the same code in Google Colaboratory it worked as expected and the output is:
Downloading (…)lve/main/config.json: 100%
730/730 [00:00<00:00, 14.0kB/s]
Downloading pytorch_model.bin: 100%
2.33G/2.33G [00:23<00:00, 109MB/s]
Downloading (…)okenizer_config.json: 100%
375/375 [00:00<00:00, 9.51kB/s]
Downloading spiece.model: 100%
4.31M/4.31M [00:00<00:00, 14.0MB/s]
Downloading (…)cial_tokens_map.json: 100%
65.0/65.0 [00:00<00:00, 2.70kB/s]
/usr/local/lib/python3.9/dist-packages/transformers/convert_slow_tokenizer.py:446: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
warnings.warn(
[{'summary_text': "Cumhurbaşkanı Recep Tayyip Erdoğan'ın dün açıkladığı en düşük emekli aylığının 7 bin 500 liraya yükseltilmesi yönündeki açıklaması, emekliler tarafından memnuniyetle karşılandı."}]
The output is not weird and the summary is understandable.
I belive the error is more about sentencepiece than pipelines. I checked some similar issues in github, stackoverflow and some chineese forums. None of the issues helped. I need some help with the code. Thanks.