How to add LSTM layer on top of Huggingface BERT model

Question

I am working on a binary classification task and would like to try adding lstm layer on top of the last hidden layer of huggingface BERT model, however, I couldn't reach the last hidden layer. Is it possible to combine BERT with LSTM?

tokenizer = BertTokenizer.from_pretrained(model_path)
tain_inputs, train_labels, train_masks = data_prepare_BERT(
    train_file, lab2ind, tokenizer, content_col, label_col, 
    max_seq_length)
validation_inputs, validation_labels, validation_masks = data_prepare_BERT(
    dev_file, lab2ind, tokenizer, content_col, label_col,max_seq_length)

# Load BertForSequenceClassification, the pretrained BERT model with a single linear classification layer on top.
model = BertForSequenceClassification.from_pretrained(
    model_path, num_labels=len(lab2ind))

There is an available answer here: https://stackoverflow.com/questions/65205582/how-can-i-add-a-bi-lstm-layer-on-top-of-bert-model/65217371#65217371 — Ashwin Geet D'Sa, Jan 18 '21 at 16:22
@AshwinGeetD'Sa, thank you! I have already tried it, but I got this error `TypeError: __init__() got an unexpected keyword argument 'batch_first'` for nn.Linear()! — Seeker, Jan 19 '21 at 07:51
The `batch_first` is only for LSTM, and not for Linear. So, do check your code again. — Ashwin Geet D'Sa, Jan 19 '21 at 09:04

score 0 · Answer 1 · answered Jan 18 '21 at 12:43

0

Indeed it is possible, but you need to implement it yourself. BertForSequenceClassification class is a wrapper for BertModel. It runs the model, takes the hidden state corresponding to the [CLS] tokens, and applies a classifier on top of that.

In your case, you can the class as a starting point, and add there an LSTM layer between the BertModel and the classifier. The BertModel returns both the hidden states and a pooled state for classification in a tuple. Just take the other tuple member than is used in the original class.

Although it is technically possible, I would expect any performance gain compared to using BertForSequenceClassification. Finetuning of the Transformer layers can learn anything that an additional LSTM layer is capable of.

answered Jan 18 '21 at 12:43

Jindřich

10,270
2
23
44

Many thanks for your reply! Dose adding LSTM to the [BertForSequenceClassification](https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/modeling_bert.py#L1449) class required additional computational cost (I'm using colab GPU)? I mean if I add it to the original class, would I need to re-train the whole model? – Seeker Jan 18 '21 at 13:17
Also, I would like to try adding BiLSTM and CNN-LSTM. Do you think this will improve performance? – Seeker Jan 18 '21 at 13:31
What do you mean re-train the whole model? The Bert model is pre-trained, you can fine-tune it or not. Anything you add on top of Bert needs to be trained from scratch, no matter if it is a simple classifier or LSTM. – Jindřich Jan 19 '21 at 09:16
Ad other architectures: it is possible that your particular problem is very well suited for LSTM or CNN and you would get significantly better performance, but I would not expect much compared to fine-tuning BERT. – Jindřich Jan 19 '21 at 09:18
Very useful! Thank you so much, @Jindřich. I thought if I would add an additional layer on top of the last hidden layer in BERT there is a way to get the hidden states of this layer and enter them to a new layer without train BERT from scratch. – Seeker Jan 19 '21 at 09:27

How to add LSTM layer on top of Huggingface BERT model

1 Answers1