How to Truncate input stream in transformers pipline

Question

I want to use the BERT model implemented via the Hugginface library. I use pipeline to do the job. here is a very sample code of how I do it.

from transformers import BertForMaskedLM, BertTokenizer, pipeline
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
fill_sentence = pipeline('fill-mask', model=model, tokenizer=tokenizer)
long_sentence = ['This is very [MASK] sentence. This is very very long sentence.' * 40]
fill_sentence(long_sentence)

The problem is that I get the following error in the very long sentences:

RuntimeError: The size of tensor a (522) must match the size of tensor b (512) at non-singleton dimension 1

I know about the model max length and that I have to truncate long sentences. I am familiar with the concept but I can't seem to figure out how to truncate input data using the pipeline.

PS1: First I tried to create a Tokenizer with truncate set to True, but apparently truncate is determined when calling the tokenizer and not when initializing it. I know using the code below will truncate the input:

tokenizer(long_sentence, truncate=True, max_length=512)

but I don't want to work directly with the model and tokenizer. my goal is to keep using the pipeline but I am not sure how to pass tokenizer call variables to the pipeline.

ps2: I have seen the following StackOverflow question and its answers but it doesn't work for me. I don't know if it is the old version or something I am doing wrong.

question link

I have been stuck on this problem for a week now and I have read the Hugginface documentation and searched for my answer on the web with no luck. so I really appreciate it if you could help me. Thanks

How to Truncate input stream in transformers pipline

0 Answers0