I find unicodedata
package to remove diacritics of latin letters like é→e or ü→u, as you can see:
>>> unicodedata.normalize('NFKD', u'éü').encode('ascii', 'ignore')
b'eu'
But it seams limited because 1) he doesn’t seems able to explode ligatures likes æ
into ae
or œ
into oe
, and 2) he doesn’t seems to translate some other symbols to the most similar equivalent in ASCII, like ı
(dotless i) to i
.
>>> unicodedata.normalize('NFKD', u'éüœıx').encode('ascii', 'ignore')
b'eux'
So, is it a package, or a way to simplify unicode characters to the most similar ones in ASCII by respecting the points (1) and (2) ?
And also it would be great if it translate non latin symbols to the most similar ones like И
(cyrillic i) to i
or أ
(arabic Alif) to a
.
Edits after @wjandrea’s questions
For the non-latin case, I know theire is many romanisation ways depending from language for each script, and also a same language could be romanised in many ways (like و
the arabic waw witch can be transcript w
or o
).
BTW, the goal isn’t to support subtilities of linguastic and translations systems or traditions, but just to avoid as much as possible a blank output.
Imagine if the input is a full cyrillic words like, for example Все люди рождаются свободными
. So, the output of unicodedata
will just give "". When it will be preferable to get at least something, no matter if it’s a correct transcription or not.