5

We would like to identify from a simple search neighborhood and streets in various cities. We don't only use English but also various other Cyrillic languages. We need to be able to identify spelling mistakes of locations. When looking at python libraries, I found this one: http://polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html

We tried to play around with it, but cannot find a way to extend the entity recognition database. How can that be done?
If not is there any other suggestion for a multi lingual nlp that can help spell check and also extract various entities matching a custom database?

Dory Zidon
  • 10,497
  • 2
  • 25
  • 39
  • From their documentation: `Polyglot requires a model for each task and language. These models are essential for the library to function. ` Unfortunately, I don't see any reference information about training additional models. – Josep Valls Jun 26 '17 at 19:26
  • 1
    Exactly my issue, how can you train these models yourself... – Dory Zidon Jun 26 '17 at 22:42
  • 1
    - We offer training datasets for many languages that you could augment and new source of data you have. https://sites.google.com/site/rmyeid/projects/polylgot-ner - We offer the word embeddings to be used as features https://sites.google.com/site/rmyeid/projects/polyglot - If you need to train new models reproduce the work described over here: https://arxiv.org/abs/1410.3791 – aboSamoor Jun 27 '17 at 20:19

1 Answers1

1

Have a look at HuggingFace's pretrained models.

  1. They have a multilingual NER model trained on 40 languages, including Cyrillic languages like Russian. It's a fine-tuned version of RoBERTa, so accuracy seems to be very good. See details here: https://huggingface.co/jplu/tf-xlm-r-ner-40-lang
  2. They also have a multilingual DistilBERT model trained for typo detection based on the GitHub Typo Corpus. The corpus seems to include typos from 15 different languages, including Russian. See details here: https://huggingface.co/mrm8488/distilbert-base-multi-cased-finetuned-typo-detection

Here is some example code from the documentation slightly altered for your use-case:

from transformers import pipeline

typo_checker = pipeline("ner", model="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection",
                        tokenizer="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection")

result = typo_checker("я живу в Мосве")
result[1:-1]

 #[{'word': 'я', 'score': 0.7886862754821777, 'entity': 'ok', 'index': 1},
 #{'word': 'жив', 'score': 0.6303715705871582, 'entity': 'ok', 'index': 2},
 #{'word': '##у', 'score': 0.7259598970413208, 'entity': 'ok', 'index': 3},
 #{'word': 'в', 'score': 0.7102937698364258, 'entity': 'ok', 'index': 4},
 #{'word': 'М', 'score': 0.5045614242553711, 'entity': 'ok', 'index': 5},
 #{'word': '##ос', 'score': 0.560469925403595, 'entity': 'typo', 'index': 6},
 #{'word': '##ве', 'score': 0.8228507041931152, 'entity': 'ok', 'index': 7}]

result = typo_checker("I live in Moskkow")
result[1:-1]

 #[{'word': 'I', 'score': 0.7598089575767517, 'entity': 'ok', 'index': 1},
 #{'word': 'live', 'score': 0.8173692226409912, 'entity': 'ok', 'index': 2},
 #{'word': 'in', 'score': 0.8289134502410889, 'entity': 'ok', 'index': 3},
 #{'word': 'Mo', 'score': 0.7344270944595337, 'entity': 'ok', 'index': 4},
 #{'word': '##sk', 'score': 0.6559176445007324, 'entity': 'ok', 'index': 5},
 #{'word': '##kow', 'score': 0.8762879967689514, 'entity': 'ok', 'index': 6}]

It doesn't seem to always work, unfortunately, but maybe it's sufficient for your use case.

Another option would be SpaCy. They don't have as many models for different languages, but with SpaCy's EntityRuler it's easy to manually define new entities i.e. "extend the entity recognition database".

Moritz
  • 2,835
  • 2
  • 6
  • 12