10

I'm trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It's returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier.

Below I'm attaching the code please look at it

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
import pandas as pd

model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')

classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)

data = pd.read_csv('/content/drive/MyDrive/DisneylandReviews.csv', encoding='latin-1')

data.head()

Output is

    Review
0   If you've ever been to Disneyland anywhere you...
1   Its been a while since d last time we visit HK...
2   Thanks God it wasn t too hot or too humid wh...
3   HK Disneyland is a great compact park. Unfortu...
4   the location is not in the city, took around 1...

Followed by

classifier("My name is mark")

Output is

[{'label': 'POSITIVE', 'score': 0.9953688383102417}]

Followed by code

basic_sentiment = [i['label'] for i in value if 'label' in i]
basic_sentiment

Output is

['POSITIVE']

Appending the total rows to empty list

text = []

for index, row in data.iterrows():
    text.append(row['Review'])

I'm trying to get the sentiment for all the rows

sent = []

for i in range(len(data)):
    sentiment = classifier(data.iloc[i,0])
    sent.append(sentiment)

The error is :

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-19-4bb136563e7c> in <module>()
      2 
      3 for i in range(len(data)):
----> 4     sentiment = classifier(data.iloc[i,0])
      5     sent.append(sentiment)

11 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1914         # remove once script supports set_grad_enabled
   1915         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1917 
   1918 

IndexError: index out of range in self
Volker Siegel
  • 3,277
  • 2
  • 24
  • 35
Nithin Reddy
  • 580
  • 2
  • 8
  • 18
  • 1
    if you just want to disable that warning then use this ```transformers.utils.logging.set_verbosity_error()``` – Ritwik Jul 09 '23 at 19:49

3 Answers3

17

some of the sentences in your Review column of the data frame are too long. when these sentences are converted to tokens and sent inside the model they are exceeding the 512 seq_length limit of the model, the embedding of the model used in the sentiment-analysis task was trained on 512 tokens embedding.

to fix this issue you can filter out the long sentences and keep only smaller ones (with token length < 512 )

or you can truncate the sentences with truncating = True

sentiment = classifier(data.iloc[i,0], truncation=True)
kiranr
  • 2,087
  • 14
  • 32
  • What if you don't want to truncate the input? What is the canonical procedure to split the input into 512 tokens input in a pipeline? – Florin Andrei Apr 17 '23 at 20:04
2

If you're tokenizing separately from your classification step, this warning can be output during tokenization itself (as opposed to classification).

In my case, I am using a BERT model, so I have MAX_TOKENS=510 (leaving room for the sequence-start and sequence-end tokens).

token = AutoTokenizer.from_pretrained("your model")
tokens = token.tokenize(
    text, max_length=MAX_TOKENS, truncation=True
)

Now, when you run your classifier, the tokens are guaranteed not to exceed the maximum length.

Eric McLachlan
  • 3,132
  • 2
  • 25
  • 37
  • What if you don't want to truncate the input? What is the canonical procedure to split the input into 512 tokens input in a pipeline? – Florin Andrei Apr 17 '23 at 20:04
  • I don't feel qualified to answer canonically but my advice would be to use an RNN-type neural architecture that will allow you to segment your input into a sequence of smaller chunks (paragraphs or sentences) that is more likely to fit within the 510-token limit. – Eric McLachlan Apr 19 '23 at 17:33
1

For those interested in suppressing the warning altogether (it's a valid use case), try:

tokenizer: PreTrainedTokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
tokenizer.model_max_length = sys.maxsize
Jose Solorzano
  • 393
  • 4
  • 6