I am currently attempting to build a named entity recognition system for the Moroccan Dialect using BERT+ BiGRU+Softmax architecture. I am encountering overfitting (F1 score on validation set around 78% and F1 score on training set around 98%). I am using a small dataset though (65,905 tokens). Can you please guide me to solve overfitting and eventually improve performance on the validation set, preferrably without the need to augment data?
I tried multiple regularization techniques, but they either did not reduce overfitting or reduced it at the expense of the overall performance. For example, I set the hidden_dropout_prob of bert to 0.3, added a dropout layer with p = 0.1 between bert and BiGRU, and another dropout layer with p = 0.4 between BiGRU and the dense layer, as well as a weight decay = 0.01. Additionaly, I made my network larger by increasing the hidden_state of BiGRU to 512 to ensure that dropout doesn't cause performance drops. This indeed reduced overfitting, but the performance on both validation and training also reduced (74% and 83% F1 respectively). Here is my PyTorch code, I am using Weights and Biases Sweeps for hyperparameter tuning:
class NER_GRU_Model(nn.Module): def __init__(self, hidden_size, num_layers, bert_dropout, p1, p2, num_classes=len(tag2idx)): super().__init__() config = transformers.BertConfig.from_pretrained(lm, hidden_dropout_prob=bert_dropout) self.bert = transformers.BertModel.from_pretrained(lm, config = config, add_pooling_layer = False) input_size = self.bert.config.to_dict()['hidden_size'] self.gru = nn.GRU(hidden_size = hidden_size, input_size = input_size, bidirectional = True, num_layers = num_layers, batch_first = True) self.fc = nn.Linear(hidden_size*2, num_classes) self.drop1 = nn.Dropout(p1) self.drop2 = nn.Dropout(p2) if freeze_bert == False: for param in self.bert.parameters(): param.requires_grad = False else: for param in self.bert.parameters(): param.requires_grad = True def forward(self, input_ids, attention_mask = None): s = self.bert(input_ids = input_ids, attention_mask=attention_mask) s = s['last_hidden_state'] s = self.drop1(s) s, _ = self.gru(s) s = self.drop2(s) s = self.fc(s) return F.log_softmax(s)
By the way, I also attempted building a simpler model (BERT+Dense Layer+Softmax) and it also suffers from the same level of overfitting. Your help is much appreciated.