Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
41
votes
5 answers

How to disable TOKENIZERS_PARALLELISM=(true | false) warning?

I use pytorch to train huggingface-transformers model, but every epoch, always output the warning: The current process just got forked. Disabling parallelism to avoid deadlocks... To disable this warning, please explicitly set…
38
votes
5 answers

ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error

def split_data(path): df = pd.read_csv(path) return train_test_split(df , test_size=0.1, random_state=100) train, test = split_data(DATA_DIR) train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list() test_texts,…
31
votes
4 answers

Transformers v4.x: Convert slow tokenizer to fast tokenizer

I'm following the transformer's pretrained model xlm-roberta-large-xnli example from transformers import pipeline classifier = pipeline("zero-shot-classification", model="joeddav/xlm-roberta-large-xnli") and I get the…
25
votes
3 answers

Huggingface saving tokenizer

I am trying to save the tokenizer in huggingface so that I can load it later from a container where I don't need access to the internet. BASE_MODEL = "distilbert-base-multilingual-cased" tokenizer =…
sachinruk
  • 9,571
  • 12
  • 55
  • 86
22
votes
2 answers

Suppress HuggingFace logging warning: "Setting `pad_token_id` to `eos_token_id`:{eos_token_id} for open-end generation."

In HuggingFace, every time I call a pipeline() object, I get a warning: `"Setting `pad_token_id` to `eos_token_id`:{eos_token_id} for open-end generation." How do I suppress this warning without suppressing all logging warnings? I want other…
Rylan Schaeffer
  • 1,945
  • 2
  • 28
  • 50
22
votes
1 answer

How does max_length, padding and truncation arguments work in HuggingFace' BertTokenizerFast.from_pretrained('bert-base-uncased')?

I am working with Text Classification problem where I want to use the BERT model as the base followed by Dense layers. I want to know how does the 3 arguments work? For example, if I have 3 sentences as: 'My name is slim shade and I am an aspiring…
20
votes
6 answers

Huggingface AlBert tokenizer NoneType error with Colab

I simply tried the sample code from hugging face website: https://huggingface.co/albert-base-v2 from transformers import AlbertTokenizer, AlbertModel tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2') text = "Replace me by any text you'd…
16
votes
3 answers

How to truncate input in the Huggingface pipeline?

I currently use a huggingface pipeline for sentiment-analysis like so: from transformers import pipeline classifier = pipeline('sentiment-analysis', device=0) The problem is that when I pass texts larger than 512 tokens, it just crashes saying that…
EtienneT
  • 5,045
  • 6
  • 36
  • 39
15
votes
2 answers

BertModel transformers outputs string instead of tensor

I'm following this tutorial that codes a sentiment analysis classifier using BERT with the huggingface library and I'm having a very odd behavior. When trying the BERT model with a sample text I get a string instead of the hidden state. This is the…
15
votes
2 answers

How to encode multiple sentences using transformers.BertTokenizer?

I would like to create a minibatch by encoding multiple sentences using transform.BertTokenizer. It seems working for a single sentence. How to make it work for several sentences? from transformers import BertTokenizer tokenizer =…
13
votes
2 answers

Download pre-trained sentence-transformers model locally

I am using the SentenceTransformers library (here: https://pypi.org/project/sentence-transformers/#pretrained-models) for creating embeddings of sentences using the pre-trained model bert-base-nli-mean-tokens. I have an application that will be…
10
votes
3 answers

HuggingFace AutoModelForCasualLM "decoder-only architecture" warning, even after setting padding_side='left'

I'm using AutoModelForCausalLM and AutoTokenizer to generate text output with DialoGPT. For whatever reason, even when using the provided examples from huggingface I get this warning: A decoder-only architecture is being used, but right-padding was…
10
votes
3 answers

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier

I'm trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It's returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512)…
8
votes
4 answers

Facing SSL Error with Huggingface pretrained models

I am facing below issue while loading the pretrained model from HuggingFace. HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /roberta-base/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1,…
8
votes
1 answer

what is so special about special tokens?

what exactly is the difference between "token" and a "special token"? I understand the following: what is a typical token what is a typical special token: MASK, UNK, SEP, etc when do you add a token (when you want to expand your vocab) What I…
1
2 3
29 30