I am not sure if this is the best place to submit that kind of question, perhaps CrossValdation would be a better place.
I am working on a text multiclass classification problem. I built a model based on BERT concept implemented in PyTorch (huggingface transformer library). The model performs pretty well, except when the input sentence has an OCR error or equivalently it is misspelled.
For instance, if the input is "NALIBU DRINK" the Bert tokenizer generates ['na', '##lib', '##u', 'drink'] and model's prediction is completely wrong. On the other hand, if I correct the first character, so my input is "MALIBU DRINK", the Bert tokenizer generates two tokens ['malibu', 'drink'] and the model makes a correct prediction with very high confidence.
Is there any way to enhance Bert tokenizer to be able to work with misspelled words?