Huggingface TextClassifcation pipeline: truncate text size

Question

I'm trying to use text_classification pipeline from Huggingface.transformers to perform sentiment-analysis, but some texts exceed the limit of 512 tokens. I want the pipeline to truncate the exceeding tokens automatically. I tried the approach from this thread, but it did not work

Here is my code:

nlp= pipeline('sentiment-analysis',
                     model=AutoModelForSequenceClassification.from_pretrained(
                        "model",
                         return_dict=False),
                     tokenizer=AutoTokenizer.from_pretrained(
                         "model",
                         return_dict=False),
                     framework="pt", return_all_scores=False)

output = nlp(article)

score 1 · Answer 1 · answered Mar 03 '22 at 23:05

In order anyone faces the same issue, here is how I solved it:

tokenizer = AutoTokenizer.from_pretrained("model", return_dict=False)

nlp= pipeline('sentiment-analysis',
                     model=AutoModelForSequenceClassification.from_pretrained(
                        "model",
                         return_dict=False),
                     tokenizer=tokenizer,
                     framework="pt", return_all_scores=False)

encoded_input = tokenizer(article, truncation=True, max_length=512)
decoded_input = tokenizer.decode(encoded_input["input_ids"], skip_special_tokens = True)
output = nlp(decoded_input)

score 0 · Answer 2 · answered Mar 04 '22 at 09:47

0

Alternatively, and a more direct way to solve this issue, you can simply specify those parameters as **kwargs in the pipeline:

from transformers import pipeline

nlp = pipeline("sentiment-analysis")
nlp(long_input, truncation=True, max_length=512)

answered Mar 04 '22 at 09:47

dennlinger

9,890
1
42
63

I realize this has also been suggested as an answer in the other thread; if it doesn't work, please specify *why* it does not work in your case, i.e., what the error message i.s – dennlinger Mar 04 '22 at 09:49
Using this approach did not work. Meaning, the text was not truncated up to 512 tokens. I read somewhere that, when a pre_trained model used, the arguments I pass won't work (truncation, max_length). Maybe that's the case. I'm not sure. – Dalireeza Mar 05 '22 at 17:46
How can you tell that the text was not truncated? – dennlinger Mar 06 '22 at 10:49
1

Well, because the `nlp(long_input, truncation=True, max_length=512)` throws the same error as `nlp(long_input)` --- tensor a is > tensor b – Dalireeza Mar 06 '22 at 19:54

Huggingface TextClassifcation pipeline: truncate text size

2 Answers2