0

So I was trying out EmoRoBERTA for emotions classification, however, some of the strings in my data is exceeding the 512 tokens limit. Is there any way to increase this limit? I read somewhere about setting max_length = 1024 but not sure if this works?

I am using this library -

from transformers import RobertaTokenizerFast, TFRobertaForSequenceClassification, pipeline
tokenizer = RobertaTokenizerFast.from_pretrained("arpanghoshal/EmoRoBERTa")
model = TFRobertaForSequenceClassification.from_pretrained("arpanghoshal/EmoRoBERTa")
Shrumo
  • 47
  • 7
  • 1
    Does this answer your question? [How to use Bert for long text classification?](https://stackoverflow.com/questions/58636587/how-to-use-bert-for-long-text-classification) – Jindřich Oct 29 '21 at 08:24
  • @Jindřich I did see that post, however I am unable to modify parameters in this particular model, was looking for a way to do that. – Shrumo Oct 29 '21 at 10:46
  • Can you clarify what you mean by "you are unable to modify parameters in this model"? Splitting the samples and averaging over them should work without modifications to the model itself. – dennlinger Oct 29 '21 at 11:01
  • @dennlinger Sorry I am new to python, but by splitting the samples, do you mean truncating? Also, to clarify my question, by not being able to modify the parameters, I meant, I tried setting the `max_length = 1024` instead of the default 512 .. just wanted to check if that's even possible for this specific model – Shrumo Oct 29 '21 at 12:31
  • The problem si that setting the `max_length` option would require you to train the model from scratch, which is likely not a valid option for your use case. By splitting, I mean simply separating longer samples into multiple samples (i.e., by splitting on sentence boundaries), which allows you to aggregate predictions across these sub-samples. – dennlinger Oct 29 '21 at 12:41

0 Answers0