2

I am working on a binary classification task and would like to try adding lstm layer on top of the last hidden layer of huggingface BERT model, however, I couldn't reach the last hidden layer. Is it possible to combine BERT with LSTM?

tokenizer = BertTokenizer.from_pretrained(model_path)
tain_inputs, train_labels, train_masks = data_prepare_BERT(
    train_file, lab2ind, tokenizer, content_col, label_col, 
    max_seq_length)
validation_inputs, validation_labels, validation_masks = data_prepare_BERT(
    dev_file, lab2ind, tokenizer, content_col, label_col,max_seq_length)

# Load BertForSequenceClassification, the pretrained BERT model with a single linear classification layer on top.
model = BertForSequenceClassification.from_pretrained(
    model_path, num_labels=len(lab2ind))
Jindřich
  • 10,270
  • 2
  • 23
  • 44
Seeker
  • 31
  • 1
  • 3

1 Answers1

0

Indeed it is possible, but you need to implement it yourself. BertForSequenceClassification class is a wrapper for BertModel. It runs the model, takes the hidden state corresponding to the [CLS] tokens, and applies a classifier on top of that.

In your case, you can the class as a starting point, and add there an LSTM layer between the BertModel and the classifier. The BertModel returns both the hidden states and a pooled state for classification in a tuple. Just take the other tuple member than is used in the original class.

Although it is technically possible, I would expect any performance gain compared to using BertForSequenceClassification. Finetuning of the Transformer layers can learn anything that an additional LSTM layer is capable of.

Jindřich
  • 10,270
  • 2
  • 23
  • 44
  • Many thanks for your reply! Dose adding LSTM to the [BertForSequenceClassification](https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/modeling_bert.py#L1449) class required additional computational cost (I'm using colab GPU)? I mean if I add it to the original class, would I need to re-train the whole model? – Seeker Jan 18 '21 at 13:17
  • Also, I would like to try adding BiLSTM and CNN-LSTM. Do you think this will improve performance? – Seeker Jan 18 '21 at 13:31
  • What do you mean re-train the whole model? The Bert model is pre-trained, you can fine-tune it or not. Anything you add on top of Bert needs to be trained from scratch, no matter if it is a simple classifier or LSTM. – Jindřich Jan 19 '21 at 09:16
  • Ad other architectures: it is possible that your particular problem is very well suited for LSTM or CNN and you would get significantly better performance, but I would not expect much compared to fine-tuning BERT. – Jindřich Jan 19 '21 at 09:18
  • Very useful! Thank you so much, @Jindřich. I thought if I would add an additional layer on top of the last hidden layer in BERT there is a way to get the hidden states of this layer and enter them to a new layer without train BERT from scratch. – Seeker Jan 19 '21 at 09:27