Huggingface: How to use bert-large-uncased in hugginface for long text classification?

Question

I am trying to use the bert-large-uncased for long sequence ending, but it's giving the error:

Code:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model = BertModel.from_pretrained("bert-large-uncased")

text = "Replace me by any text you'd like."*1024
encoded_input = tokenizer(text, truncation=True, max_length=1024, return_tensors='pt')
output = model(**encoded_input)

It's giving the following error :

~/.local/lib/python3.6/site-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
    218         if self.position_embedding_type == "absolute":
    219             position_embeddings = self.position_embeddings(position_ids)
--> 220             embeddings += position_embeddings
    221         embeddings = self.LayerNorm(embeddings)
    222         embeddings = self.dropout(embeddings)

RuntimeError: The size of tensor a (1024) must match the size of tensor b (512) at non-singleton dimension 1

I also tried to change the default size of the positional embedding:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model = BertModel.from_pretrained("bert-large-uncased")
model.config.max_position_embeddings = 1024
text = "Replace me by any text you'd like."*1024
encoded_input = tokenizer(text, truncation=True, max_length=1024, return_tensors='pt')
output = model(**encoded_input)

But still the error is persistent, How to use large model for 1024 length sequences?

skyzip · Answer 1 · 2022-08-06T14:14:08.487

I might be wrong, but I think you already have your answers here: How to use Bert for long text classification?

Basically you will need some kind of truncation on your text, or you will need to handle it in chunks, and stick them back together.

Side note: large model is not called large because of the sequence length. Max sequence length will be still 512 tokens. (tokens from your tokenizer, not words in your sentence)

EDIT:

The pretrained model you would like to use is trained on a maximum of 512 tokens. When you download it from huggingface, you can see max_position_embeddings in the configuration, which is 512. That means that you can not really extend on this. (actually that is not true)

However you can always tweak your configurations.

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model = BertModel.from_pretrained(
    'bert-large-uncased',
    max_position_embeddings=1024,
    ignore_mismatched_sizes=True
)

Note, that this is very ill-advised since it will ruin your pretrained model. Maybe it will go rouge, planets start to collide, or pigs will start to fall out of the skies. No one can really tell. Use it at your own risk.

@AadityaUra That means you want to know how to extend `bert-large-uncased` to 1024 position_embeddings even if the added embeddings are untrained and ruin the model performance? — cronoik, Aug 06 '22 at 13:31
I modified my answer. But I really think you should be better of training a whole other `BertModel` with your configuration. — skyzip, Aug 06 '22 at 14:15

Huggingface: How to use bert-large-uncased in hugginface for long text classification?

1 Answers1