3

I'm using pytorch and I'm using the base pretrained bert to classify sentences for hate speech. I want to implement a Bi-LSTM layer that takes as an input all outputs of the latest transformer encoder from the bert model as a new model (class that implements nn.Module), and i got confused with the nn.LSTM parameters. I tokenized the data using

bert = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=int(data['class'].nunique()),output_attentions=False,output_hidden_states=False)

My data-set has 2 columns: class(label), sentence. Can someone help me with this? Thank you in advance.

Edit: Also, after processing the input in the bi-lstm, the network sends the final hidden state to a fully connected network that performs classication using the softmax activation function. how can I do that ?

Alaa Grable
  • 81
  • 2
  • 7

1 Answers1

9

You can do it as follows:

from transformers import BertModel
class CustomBERTModel(nn.Module):
    def __init__(self):
          super(CustomBERTModel, self).__init__()
          self.bert = BertModel.from_pretrained("bert-base-uncased")
          ### New layers:
          self.lstm = nn.LSTM(768, 256, batch_first=True,bidirectional=True)
          self.linear = nn.Linear(256*2, <number_of_classes>)
          

    def forward(self, ids, mask):
          sequence_output, pooled_output = self.bert(
               ids, 
               attention_mask=mask)

          # sequence_output has the following shape: (batch_size, sequence_length, 768)
          lstm_output, (h,c) = self.lstm(sequence_output) ## extract the 1st token's embeddings
          hidden = torch.cat((lstm_output[:,-1, :256],lstm_output[:,0, 256:]),dim=-1)
          linear_output = self.linear(hidden.view(-1,256*2)) ### assuming that you are only using the output of the last LSTM cell to perform classification

          return linear_output

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
model = CustomBERTModel()

Ashwin Geet D'Sa
  • 6,346
  • 2
  • 31
  • 59
  • I have a few questions: (1). how is this a Bi-LSTM layer? (2). how and where you are taking all outputs of the latest transformer encoder from the bert model ? – Alaa Grable Dec 09 '20 at 13:45
  • I had initially given the answer for LSTM. I have updated the answer according to Bi-LSTM – Ashwin Geet D'Sa Dec 09 '20 at 14:15
  • Answer for 2: I have already added the comments on `sequence_output` variable. – Ashwin Geet D'Sa Dec 09 '20 at 14:23
  • Thank you so much, so the output that has all the outputs of the last transformer encoders will be in sequence_output ? also i have another question, where do I specify the number of classes ? because I'm trying to classify the sentences into 3 labels, and I saw examples that they do `BertModel.from_pretrained("bert-base-uncased", num_labels = 2)` or you don't have to do it because you add the number of labels in nn.Linear ? also, why the input for nn.Linear is 256*2 ? @Ashwin Geet D'Sa – Alaa Grable Dec 09 '20 at 15:29
  • `self.linear = nn.Linear(256*2, , batch_first=True)`; specify the number of classes there. – Ashwin Geet D'Sa Dec 09 '20 at 16:25
  • Thank you, and last question, you said in a comment in the code : "assuming that you are only using the output of the last LSTM cell to perform classification", what i want is after processing the input in the bi-lstm, the network sends the final hidden state to a fully connected network that performs classication using the softmax activation function. how can i do that ? @Ashwin Geet D'Sa – Alaa Grable Dec 10 '20 at 06:04
  • Isn't the code already doing the operation of FC layer? You just have to add softmax after linear. Moreover, this is not a place to ask sequential questions. That's why the quidelines asks to mention what you have tried in the question itself, which you haven't. – Ashwin Geet D'Sa Dec 10 '20 at 08:47