Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers
Questions tagged [huggingface-tokenizers]
451 questions
41
votes
5 answers
How to disable TOKENIZERS_PARALLELISM=(true | false) warning?
I use pytorch to train huggingface-transformers model, but every epoch, always output the warning:
The current process just got forked. Disabling parallelism to avoid deadlocks... To disable this warning, please explicitly set…

snowzjy
- 521
- 1
- 4
- 5
38
votes
5 answers
ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error
def split_data(path):
df = pd.read_csv(path)
return train_test_split(df , test_size=0.1, random_state=100)
train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list()
test_texts,…

Raoof Naushad
- 526
- 1
- 5
- 7
31
votes
4 answers
Transformers v4.x: Convert slow tokenizer to fast tokenizer
I'm following the transformer's pretrained model xlm-roberta-large-xnli example
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
model="joeddav/xlm-roberta-large-xnli")
and I get the…

Miguel Trejo
- 5,913
- 5
- 24
- 49
25
votes
3 answers
Huggingface saving tokenizer
I am trying to save the tokenizer in huggingface so that I can load it later from a container where I don't need access to the internet.
BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer =…

sachinruk
- 9,571
- 12
- 55
- 86
22
votes
2 answers
Suppress HuggingFace logging warning: "Setting `pad_token_id` to `eos_token_id`:{eos_token_id} for open-end generation."
In HuggingFace, every time I call a pipeline() object, I get a warning:
`"Setting `pad_token_id` to `eos_token_id`:{eos_token_id} for open-end generation."
How do I suppress this warning without suppressing all logging warnings? I want other…

Rylan Schaeffer
- 1,945
- 2
- 28
- 50
22
votes
1 answer
How does max_length, padding and truncation arguments work in HuggingFace' BertTokenizerFast.from_pretrained('bert-base-uncased')?
I am working with Text Classification problem where I want to use the BERT model as the base followed by Dense layers. I want to know how does the 3 arguments work? For example, if I have 3 sentences as:
'My name is slim shade and I am an aspiring…

Deshwal
- 3,436
- 4
- 35
- 94
20
votes
6 answers
Huggingface AlBert tokenizer NoneType error with Colab
I simply tried the sample code from hugging face website: https://huggingface.co/albert-base-v2
from transformers import AlbertTokenizer, AlbertModel
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
text = "Replace me by any text you'd…

MeiNan Zhu
- 1,021
- 1
- 9
- 18
16
votes
3 answers
How to truncate input in the Huggingface pipeline?
I currently use a huggingface pipeline for sentiment-analysis like so:
from transformers import pipeline
classifier = pipeline('sentiment-analysis', device=0)
The problem is that when I pass texts larger than 512 tokens, it just crashes saying that…

EtienneT
- 5,045
- 6
- 36
- 39
15
votes
2 answers
BertModel transformers outputs string instead of tensor
I'm following this tutorial that codes a sentiment analysis classifier using BERT with the huggingface library and I'm having a very odd behavior. When trying the BERT model with a sample text I get a string instead of the hidden state. This is the…

Miguel
- 2,738
- 3
- 35
- 51
15
votes
2 answers
How to encode multiple sentences using transformers.BertTokenizer?
I would like to create a minibatch by encoding multiple sentences using transform.BertTokenizer. It seems working for a single sentence. How to make it work for several sentences?
from transformers import BertTokenizer
tokenizer =…

Lei Hao
- 708
- 1
- 7
- 21
13
votes
2 answers
Download pre-trained sentence-transformers model locally
I am using the SentenceTransformers library (here: https://pypi.org/project/sentence-transformers/#pretrained-models) for creating embeddings of sentences using the pre-trained model bert-base-nli-mean-tokens. I have an application that will be…

neha tamore
- 181
- 1
- 1
- 8
10
votes
3 answers
HuggingFace AutoModelForCasualLM "decoder-only architecture" warning, even after setting padding_side='left'
I'm using
AutoModelForCausalLM and AutoTokenizer to generate text output with DialoGPT.
For whatever reason, even when using the provided examples from huggingface I get this warning:
A decoder-only architecture is being used, but right-padding was…

TurboToaster33
- 101
- 1
- 4
10
votes
3 answers
Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier
I'm trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It's returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512)…

Nithin Reddy
- 580
- 2
- 8
- 18
8
votes
4 answers
Facing SSL Error with Huggingface pretrained models
I am facing below issue while loading the pretrained model from HuggingFace.
HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /roberta-base/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1,…

chaitu
- 1,036
- 5
- 20
- 39
8
votes
1 answer
what is so special about special tokens?
what exactly is the difference between "token" and a "special token"?
I understand the following:
what is a typical token
what is a typical special token: MASK, UNK, SEP, etc
when do you add a token (when you want to expand your vocab)
What I…

ShaoMin Liu
- 93
- 1
- 6