I want to use raw bert without fine-tuning for sentiment analysis but there is a limitation on the number of tokens for each review there are some related questions on the same issue but on tensor flow itself, not the bert model so solutions won't work for me also adding max_len = 5000 to the tokenizer doesn't work because it gets overwritten by the config file of the model (someone said on a question that doesn't have an answer yet here is the question link
I did this solution in the code which is to take the first 512 tokens and truncate the rest which may affect the overall sentiment so I want to remove it here is my code with taking the first 512 tokens
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import re
import pandas as pd
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-
uncased-sentiment')
def predict_sentiment(tweet):
tokens = tokenizer.encode(tweet[:512], return_tensors='pt')
result = model(tokens)
res = int(torch.argmax(result.logits)) + 1
if res == 1 | res == 2:
res == 1
elif res == 3:
res = 2
else:
res = 3
return res
print(predict_sentiment("hello world"))
and this is the error stack trace when I remove the [:512]
Token indices sequence length is longer than the specified maximum sequence length for this model (611 > 512). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
File "C:\Users\Ahmed\PycharmProjects\SentimentModel\Model", line 24, in <module>
print(predict_sentiment(tweets['Text'][82]))
File "C:\Users\Ahmed\PycharmProjects\SentimentModel\Model", line 11, in predict_sentiment
result = model(tokens)
File "C:\Users\Ahmed\PycharmProjects\SentimentModel\venv\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Ahmed\PycharmProjects\SentimentModel\venv\lib\site-packages\transformers\models\bert\modeling_bert.py", line 1556, in forward
outputs = self.bert(
File "C:\Users\Ahmed\PycharmProjects\SentimentModel\venv\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Ahmed\PycharmProjects\SentimentModel\venv\lib\site-packages\transformers\models\bert\modeling_bert.py", line 984, in forward
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (611) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 611]. Tensor sizes: [1, 512]