Text classification using BERT - how to handle misspelled words

Question

I am not sure if this is the best place to submit that kind of question, perhaps CrossValdation would be a better place.

I am working on a text multiclass classification problem. I built a model based on BERT concept implemented in PyTorch (huggingface transformer library). The model performs pretty well, except when the input sentence has an OCR error or equivalently it is misspelled.

For instance, if the input is "NALIBU DRINK" the Bert tokenizer generates ['na', '##lib', '##u', 'drink'] and model's prediction is completely wrong. On the other hand, if I correct the first character, so my input is "MALIBU DRINK", the Bert tokenizer generates two tokens ['malibu', 'drink'] and the model makes a correct prediction with very high confidence.

Is there any way to enhance Bert tokenizer to be able to work with misspelled words?

score 3 · Answer 1 · edited Sep 01 '22 at 08:38

3

You can leverage BERT's power to rectify the misspelled word. The article linked below beautifully explains the process with code snippets https://web.archive.org/web/20220507023114/https://www.statestitle.com/resource/using-nlp-bert-to-improve-ocr-accuracy/

To summarize, you can identify misspelled words via a SpellChecker function and get replacement suggestions. Then, find the most appropriate replacement using BERT.

edited Sep 01 '22 at 08:38

Matthew Walker

2,527
3
24
30

answered Apr 06 '20 at 23:40

NRJ_Varshney

137
9

the link seems to be dead now – raquelhortab Jul 17 '22 at 09:54

Text classification using BERT - how to handle misspelled words

1 Answers1