4

Real-world user-generated text in non-Latin alphabet languages is often not in canonical form but in translit, shlyokavitsa, arabizi, pinyin and so on. Language detection software is starting to handle it smartly, but usually it doesn't work, even though it's technically fairly trivial to incorporate it.

enter image description here

Is there a language detection system that is handling these informal Latinisations well? (Ideally a Python lib, but any language or service would be interesting.)

The Yandex, Microsoft and top Python lang id libs, like langid, have nothing on this front. Two that halfway work are known to me, both from Google:
- CLD, which is part of Chrome
- the Google Translate API
Besides only recognising translit for a few top languages, they are not ideal for a variety of reasons (accuracy, performance, price...)

This is a major issue for major languages like Hindi, Persian, Chinese, Arabic and Russian, and for all the other languages not written in the Latin alphabet but commonly Latinised (Romanised) online.

Adam Bittlingmayer
  • 1,169
  • 9
  • 22
  • You can detect pinyin with a regex: https://stackoverflow.com/questions/20736291/regex-for-matching-pinyin/20736292#20736292 – ccpizza Jul 02 '20 at 21:43
  • @ccpizza I need it for all langs. – Adam Bittlingmayer Jul 03 '20 at 04:56
  • 1
    In that case you'd probably need to roll out your own machine learning detector based on pre-trained models (like google does) or buy it as a service from a third party which has proper datasets. – ccpizza Jul 03 '20 at 08:27

0 Answers0