At my University's research group we have been pre-training a RoBERTa model for Portuguese and also a domain-specific one, also based on RoBERTa. We have been conducting a series of benchmarks using huggingface's transformers library, and the RoBERTa models are performing better than the existing Portuguese BERT model for almost all datasets and tasks.
One of the tasks we're focusing on is NER, and since AllenNLP supports a CRF-based NER model, we were looking forward to seeing if we would get even greater improvements using these new RoBERTa models combined with AllenNLP's crf_tagger. We used the same jsonnet config we were using for BERT, only switching to RoBERTa, and ran a grid search on some hyperparameters to look for the best model. We tested hyperparameters such as weight decay and learning rate (for huggingface_adamw optimizer) and dropout (for crf_tagger), using 3 different seeds. To our surprise, the RoBERTa models weren't getting better results than the existing BERT model, which contradicted the experiments using transformers. It wasn't even a tie, the BERT model was much better (90.43% for the best BERT x 89.27% for the best RoBERTa).
This made us suspicious that AllenNLP could be somehow biased towards BERT, then we decided to run an English-specific standard benchmark (CoNLL 2003) for NER using transformers and AllenNLP, and the results we got enforced this suspicion. For AllenNLP, we ran a grid search keeping the exact jsonnet config, changing only the learning rate (from 8e-6 to 7e-5), the learning rate scheduler (slanted_triangular and linear_with_warmup with 10% and 3% of the steps with warmup) and the model, of course (bert-base-cased and roberta-base). The results we got for AllenNLP were surprising: absolutely all models trained with bert-base-cased were better than all roberta-base models (best BERT was 91.65% on the test set and best RoBERTa was 90.63%).
For transformers, we did almost the same thing, except we didn't change the learning rate scheduler there, we kept the default one, which is linear with warmup, using 10% warmup ratio. We tested the same learning rates, and also applied 3 different seeds. The results we got for transformers were exactly the opposite: all roberta-base models were better than all bert-base-cased models (best RoBERTa was 92.46% on the test set and best BERT was 91.58%).
Is there something in AllenNLP framework that could be making these trained NER models biased towards BERT, and underperforming for RoBERTa? Where could we start looking for possible issues? Doesn't look like a hyperparameter issue, since we tested so many combinations with grid search so far.
Thanks!