Is AllenNLP biased towards BERT?

Question

At my University's research group we have been pre-training a RoBERTa model for Portuguese and also a domain-specific one, also based on RoBERTa. We have been conducting a series of benchmarks using huggingface's transformers library, and the RoBERTa models are performing better than the existing Portuguese BERT model for almost all datasets and tasks.

One of the tasks we're focusing on is NER, and since AllenNLP supports a CRF-based NER model, we were looking forward to seeing if we would get even greater improvements using these new RoBERTa models combined with AllenNLP's crf_tagger. We used the same jsonnet config we were using for BERT, only switching to RoBERTa, and ran a grid search on some hyperparameters to look for the best model. We tested hyperparameters such as weight decay and learning rate (for huggingface_adamw optimizer) and dropout (for crf_tagger), using 3 different seeds. To our surprise, the RoBERTa models weren't getting better results than the existing BERT model, which contradicted the experiments using transformers. It wasn't even a tie, the BERT model was much better (90.43% for the best BERT x 89.27% for the best RoBERTa).

This made us suspicious that AllenNLP could be somehow biased towards BERT, then we decided to run an English-specific standard benchmark (CoNLL 2003) for NER using transformers and AllenNLP, and the results we got enforced this suspicion. For AllenNLP, we ran a grid search keeping the exact jsonnet config, changing only the learning rate (from 8e-6 to 7e-5), the learning rate scheduler (slanted_triangular and linear_with_warmup with 10% and 3% of the steps with warmup) and the model, of course (bert-base-cased and roberta-base). The results we got for AllenNLP were surprising: absolutely all models trained with bert-base-cased were better than all roberta-base models (best BERT was 91.65% on the test set and best RoBERTa was 90.63%).

For transformers, we did almost the same thing, except we didn't change the learning rate scheduler there, we kept the default one, which is linear with warmup, using 10% warmup ratio. We tested the same learning rates, and also applied 3 different seeds. The results we got for transformers were exactly the opposite: all roberta-base models were better than all bert-base-cased models (best RoBERTa was 92.46% on the test set and best BERT was 91.58%).

Is there something in AllenNLP framework that could be making these trained NER models biased towards BERT, and underperforming for RoBERTa? Where could we start looking for possible issues? Doesn't look like a hyperparameter issue, since we tested so many combinations with grid search so far.

Thanks!

score 1 · Answer 1 · answered Aug 19 '22 at 17:41

If model-biased behavior does exist, I'd expect it to be somewhere in the implementations of the Transformer-related modules, viz. PretrainedTransformerIndexer, PretrainedTransformerTokenizer, PretrainedTransformerEmbedder, etc.

It may be worth checking whether RoBERTa's special tokens (i.e., <s>, </s>, <pad>, <unk>, and <mask>) are being used. My understanding is that AllenNLP attempts to infer these, but if this inference process failed, then it's possible that e.g. the tokenizer would be preparing sequences with another model's special tokens, e.g. [CLS] instead of <s>, etc.

Thanks Luke, I'll try to do some debugging focusing on these modules. — pvcastro, Aug 24 '22 at 17:58

Luke G · Answer 2 · 2022-12-28T06:35:33.100

I think I've figured this out. This behavior is likely caused by AllenNLP's default implementation of tokenization: when a pre-existing tokenization with paired tags is provided (as I assume it is since you are working with NER datasets where tags must be paired with tokens), PretrainedTransformerTokenizer.intra_word_tokenize is used, and this tokenization function does not add a leading space to tokens, causing suboptimal wordpiece tokenization.

Recall that the RoBERTa tokenizer uses byte-pair encoding, which uses special characters (Ġ in some implementations) to indicate the initial wordpiece of whitespace-separated tokens, while BERT uses ## to indicate non-initial wordpieces of whitespace-separated tokens. Observe:

>>> from transformers import BertTokenizer, RobertaTokenizer
>>> rt = RobertaTokenizer.from_pretrained('roberta-base')
>>> bt = BertTokenizer.from_pretrained('bert-base-cased')
>>> bt.tokenize('modern artistry')
['modern', 'artist', '##ry']
>>> rt.tokenize('modern artistry')
['modern', 'Ġart', 'istry']

RoBERTa does have the option add_prefix_space which adds a space to the beginning-of-sequence token, but this is False by default, at least on roberta-base.

>>> rt.add_prefix_space
False
>>> rt.add_prefix_space = True
>>> rt.tokenize('modern artistry')
['Ġmodern', 'Ġart', 'istry']

Now, for AllenNLP: I expect that you used the PretrainedTransformerMismatchedEmbedder and PretrainedTransformerMismatchedIndexer setup, since you're doing NER. The indexer uses the intra_word_tokenize function of PretrainedTransformerTokenizer, and a quick look at its implementation reveals that what it is doing is invoking the tokenizer for each individual token.

Why is this a problem? Well, this works fine if you're using WordPiece tokenization (like with BERT) since whitespace does not need to be present in the tokenizer's input for good subword tokenization to occur. However, BPE tokenization does require whitespace to be in the input string, and if we're calling the tokenizer on tokens without whitespace in them, then the BPE tokenizer no longer knows how to distinguish which subwords are token-initial! Consider:

# From before
>>> rt.tokenize('modern artistry')
['modern', 'Ġart', 'istry']
# The way AllenNLP does it. Bad, no initial "Ġ" on "art"!
>>> [wp for token in ['modern', 'artistry'] for wp in t.tokenize(token)]
['modern', 'art', 'istry']
# This is equivalent to tokenizing a whole string with no space:
>>> t.tokenize('modernartistry')
['modern', 'art', 'istry']

This information about token boundaries is potentially meaningful. Consider two strings ax island and axis land which have different meanings in English. If you tokenize it the way AllenNLP does, the input IDs for the wordpieces will be substantially different (!):

# Intended
>>> t.tokenize('axis land')
['axis', 'Ġland']
>>> t.tokenize('ax island')
['ax', 'Ġisland']
# What AllenNLP gives you
>>> [wp for token in ['axis', 'land'] for wp in t.tokenize(token)]
['axis', 'land']
>>> [wp for token in ['ax', 'island'] for wp in t.tokenize(token)]
['ax', 'is', 'land']

So, to mitigate this, you would need to modify intra_word_tokenize somehow to bring the wordpieces more in line with what you'd expect. I'm not positive this is exactly what's causing the performance issues you note, but I'm pretty sure this tokenization issue ought to be happening for you, and if it is, I would expect performance degradations due to the suboptimal wordpiece tokenization. A cheap solution would be to flip add_prefix_space on, but there may be other problems that that could subtly cause--I haven't considered it yet.

Thanks a lot Luke, I'll try to see if I can come up with a simple fix for this. Possibily making this add_prefix_space a parameter, at least, should have a lesser impact, right? — pvcastro, Dec 29 '22 at 12:54

Is AllenNLP biased towards BERT?

2 Answers2