Language detection for pinyin, translit etc?

Question

Real-world user-generated text in non-Latin alphabet languages is often not in canonical form but in translit, shlyokavitsa, arabizi, pinyin and so on. Language detection software is starting to handle it smartly, but usually it doesn't work, even though it's technically fairly trivial to incorporate it.

Is there a language detection system that is handling these informal Latinisations well? (Ideally a Python lib, but any language or service would be interesting.)

The Yandex, Microsoft and top Python lang id libs, like langid, have nothing on this front. Two that halfway work are known to me, both from Google:
- CLD, which is part of Chrome
- the Google Translate API
Besides only recognising translit for a few top languages, they are not ideal for a variety of reasons (accuracy, performance, price...)

This is a major issue for major languages like Hindi, Persian, Chinese, Arabic and Russian, and for all the other languages not written in the Latin alphabet but commonly Latinised (Romanised) online.

You can detect pinyin with a regex: https://stackoverflow.com/questions/20736291/regex-for-matching-pinyin/20736292#20736292 — ccpizza, Jul 02 '20 at 21:43
In that case you'd probably need to roll out your own machine learning detector based on pre-trained models (like google does) or buy it as a service from a third party which has proper datasets. — ccpizza, Jul 03 '20 at 08:27

Language detection for pinyin, translit etc?

0 Answers0