Questions tagged [sentencepiece]
21 questions
13
votes
1 answer
sentencepiece library is not being installed in the system
While using pip install tf-models-official I found the following problem while the library is getting installed:-
Collecting tf-models-official
Using cached tf_models_official-2.8.0-py2.py3-none-any.whl (2.2 MB)
Requirement already satisfied:…

Daremitsu
- 545
- 2
- 8
- 24
8
votes
2 answers
How to add new special token to the tokenizer?
I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased).
QUERY: I want to ask a question.
ANSWER: Sure, ask away.
QUERY: How is the weather today?
ANSWER: It is…

sid8491
- 6,622
- 6
- 38
- 64
5
votes
1 answer
why does huggingface t5 tokenizer ignore some of the whitespaces?
I am using T5 model and tokenizer for a downstream task. I want to add certain whitespaces to the tokenizer like line ending (\t) and tab (\t). Adding these tokens work but somehow the tokenizer always ignores the second whitespace. So, it tokenizes…

Berkay Berabi
- 1,933
- 1
- 10
- 26
2
votes
0 answers
SentencePiece tokenizer encodes to unknown token
I am using HuggigFace implementation of SentencePiece tokenizer, i.e., SentencePieceBPETokenizer and SentencePieceUnigramTokenizer classes. I train these tokenizers on dataset which has no unicode characters and then try to encode the string that…

Shital Shah
- 63,284
- 17
- 238
- 185
2
votes
1 answer
Error while converting pth file to ggml.py format
Error:
That I'm getting when I try to convert-pth-to-ggml.py
Don't know whether the error is in my file management due to which model is unable to load or it is due to OS
Traceback (most recent call last):
File…

Tanish Shah
- 39
- 5
2
votes
1 answer
(OpenNMT) Spanish to English Model Improvement
I’m currently trying to train a Spanish to English model using yaml scripts. My data set is pretty big but just for starters, I’m trying to get a 10,000 training set and 1000-2000 validation set working well first. However, after trying for days, I…

Jose Chavez
- 115
- 9
1
vote
1 answer
libsentencepiece.so.0: cannot open shared object file: No such file or directory when creating BERTopic model
I am trying to train a BERTopic Model in python. However, I get this error:
RuntimeError: Failed to import transformers.models.auto because of the following error (look up to see its traceback):
libsentencepiece.so.0: cannot open shared object file:…

kmcclenn
- 127
- 11
1
vote
0 answers
Got the "Unable to load vocabulary from file." while using pipelines
I have been trying to use the "csebuetnlp/mT5_multilingual_XLSum" model for summarization purposes.
The code I tried is listed as below:
!pip install transformers
!pip install sentencepiece
import transformers
text_example = """
En düşük emekli…

dicloflom
- 11
- 1
1
vote
0 answers
how to integrate sentencepiece, protobuf into existing android project correctly
I am trying to integrate pytorch model to process language. This is why I need the sentencepiece to tokenize the sentence chunks. But I am unable to do that correctly.
I did not find any robust documentation of integrating sentencepiece into android…

im07
- 386
- 2
- 12
1
vote
1 answer
Saving SentencepieceTokenizer in Keras model throws TypeError: Failed to convert elements of [None, None] to Tensor
I'm trying to save a Keras model which uses a SentencepieceTokenizer.
Everything is working so far but I am unable to save the Keras model.
After training the sentencepiece model, I am creating the Keras model, call it with some examples first and…

Stefan Falk
- 23,898
- 50
- 191
- 378
1
vote
0 answers
Slow and Fast tokenizer gives different outputs(sentencepiece tokenization)
When i use T5TokenizerFast(Tokenizer of T5 architecture), the output is expected as follows:
['▁', '', '▁Hello', '▁', '', '']
But when i use the normal tokenizer, it starts to split special token "/s>" as follows:
['▁', 's', '>',…

canP
- 25
- 4
1
vote
1 answer
SentencePiece in Google Colab
I want to use sentencepiece, from https://github.com/google/sentencepiece in a Google Colab project where I am training an OpenNMT model. I'm a little confused with how to set up the sentencepiece binaries in Google Colab. Do I need to build with…

Jose Chavez
- 115
- 9
1
vote
1 answer
How to add new token to T5 tokenizer which uses sentencepieace
I train the t5 transformer which is based on tensorflow at the following link:
https://github.com/google-research/text-to-text-transfer-transformer
Here is a sample (input, output):
input:
b'[atomic]:PersonX plays a ___ in the…

Ahmad
- 8,811
- 11
- 76
- 141
1
vote
0 answers
"OSError: Model name './XX' was not found in tokenizers model name list" - cannot load custom tokenizer in Transformers
I'm trying to create my own tokenizer with my own dataset/vocabulary using Sentencepiece and then use it with AlbertTokenizer transformers.
I followed really closely the tutorial on how to train a model from scratch by HuggingFace:…

tlqn
- 349
- 1
- 6
- 18
1
vote
1 answer
How can I update sentencepiece package to its latest version using conda?
I have installed conda on linux ubuntu 16. When I install or update a package named sentencepiece it install the version 0.1.85 (which I guess is from 2 months ago according to anaconda website). However the latest version is 0.1.91.
I can't install…

Ahmad
- 8,811
- 11
- 76
- 141