Questions tagged [sentencepiece]

21 questions
13
votes
1 answer

sentencepiece library is not being installed in the system

While using pip install tf-models-official I found the following problem while the library is getting installed:- Collecting tf-models-official Using cached tf_models_official-2.8.0-py2.py3-none-any.whl (2.2 MB) Requirement already satisfied:…
8
votes
2 answers

How to add new special token to the tokenizer?

I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased). QUERY: I want to ask a question. ANSWER: Sure, ask away. QUERY: How is the weather today? ANSWER: It is…
sid8491
  • 6,622
  • 6
  • 38
  • 64
5
votes
1 answer

why does huggingface t5 tokenizer ignore some of the whitespaces?

I am using T5 model and tokenizer for a downstream task. I want to add certain whitespaces to the tokenizer like line ending (\t) and tab (\t). Adding these tokens work but somehow the tokenizer always ignores the second whitespace. So, it tokenizes…
2
votes
0 answers

SentencePiece tokenizer encodes to unknown token

I am using HuggigFace implementation of SentencePiece tokenizer, i.e., SentencePieceBPETokenizer and SentencePieceUnigramTokenizer classes. I train these tokenizers on dataset which has no unicode characters and then try to encode the string that…
2
votes
1 answer

Error while converting pth file to ggml.py format

Error: That I'm getting when I try to convert-pth-to-ggml.py Don't know whether the error is in my file management due to which model is unable to load or it is due to OS Traceback (most recent call last): File…
2
votes
1 answer

(OpenNMT) Spanish to English Model Improvement

I’m currently trying to train a Spanish to English model using yaml scripts. My data set is pretty big but just for starters, I’m trying to get a 10,000 training set and 1000-2000 validation set working well first. However, after trying for days, I…
1
vote
1 answer

libsentencepiece.so.0: cannot open shared object file: No such file or directory when creating BERTopic model

I am trying to train a BERTopic Model in python. However, I get this error: RuntimeError: Failed to import transformers.models.auto because of the following error (look up to see its traceback): libsentencepiece.so.0: cannot open shared object file:…
kmcclenn
  • 127
  • 11
1
vote
0 answers

Got the "Unable to load vocabulary from file." while using pipelines

I have been trying to use the "csebuetnlp/mT5_multilingual_XLSum" model for summarization purposes. The code I tried is listed as below: !pip install transformers !pip install sentencepiece import transformers text_example = """ En düşük emekli…
1
vote
0 answers

how to integrate sentencepiece, protobuf into existing android project correctly

I am trying to integrate pytorch model to process language. This is why I need the sentencepiece to tokenize the sentence chunks. But I am unable to do that correctly. I did not find any robust documentation of integrating sentencepiece into android…
im07
  • 386
  • 2
  • 12
1
vote
1 answer

Saving SentencepieceTokenizer in Keras model throws TypeError: Failed to convert elements of [None, None] to Tensor

I'm trying to save a Keras model which uses a SentencepieceTokenizer. Everything is working so far but I am unable to save the Keras model. After training the sentencepiece model, I am creating the Keras model, call it with some examples first and…
Stefan Falk
  • 23,898
  • 50
  • 191
  • 378
1
vote
0 answers

Slow and Fast tokenizer gives different outputs(sentencepiece tokenization)

When i use T5TokenizerFast(Tokenizer of T5 architecture), the output is expected as follows: ['▁', '', '▁Hello', '▁', '', ''] But when i use the normal tokenizer, it starts to split special token "/s>" as follows: ['▁',…
canP
  • 25
  • 4
1
vote
1 answer

SentencePiece in Google Colab

I want to use sentencepiece, from https://github.com/google/sentencepiece in a Google Colab project where I am training an OpenNMT model. I'm a little confused with how to set up the sentencepiece binaries in Google Colab. Do I need to build with…
1
vote
1 answer

How to add new token to T5 tokenizer which uses sentencepieace

I train the t5 transformer which is based on tensorflow at the following link: https://github.com/google-research/text-to-text-transfer-transformer Here is a sample (input, output): input: b'[atomic]:PersonX plays a ___ in the…
Ahmad
  • 8,811
  • 11
  • 76
  • 141
1
vote
0 answers

"OSError: Model name './XX' was not found in tokenizers model name list" - cannot load custom tokenizer in Transformers

I'm trying to create my own tokenizer with my own dataset/vocabulary using Sentencepiece and then use it with AlbertTokenizer transformers. I followed really closely the tutorial on how to train a model from scratch by HuggingFace:…
1
vote
1 answer

How can I update sentencepiece package to its latest version using conda?

I have installed conda on linux ubuntu 16. When I install or update a package named sentencepiece it install the version 0.1.85 (which I guess is from 2 months ago according to anaconda website). However the latest version is 0.1.91. I can't install…
Ahmad
  • 8,811
  • 11
  • 76
  • 141
1
2