1

I find unicodedata package to remove diacritics of latin letters like é→e or ü→u, as you can see:

>>> unicodedata.normalize('NFKD', u'éü').encode('ascii', 'ignore')
b'eu'

But it seams limited because 1) he doesn’t seems able to explode ligatures likes æ into aeor œ into oe, and 2) he doesn’t seems to translate some other symbols to the most similar equivalent in ASCII, like ı (dotless i) to i.

>>> unicodedata.normalize('NFKD', u'éüœıx').encode('ascii', 'ignore')
b'eux'

So, is it a package, or a way to simplify unicode characters to the most similar ones in ASCII by respecting the points (1) and (2) ? And also it would be great if it translate non latin symbols to the most similar ones like И (cyrillic i) to i or أ (arabic Alif) to a .


Edits after @wjandrea’s questions

For the non-latin case, I know theire is many romanisation ways depending from language for each script, and also a same language could be romanised in many ways (like و the arabic waw witch can be transcript w or o).

BTW, the goal isn’t to support subtilities of linguastic and translations systems or traditions, but just to avoid as much as possible a blank output.

Imagine if the input is a full cyrillic words like, for example Все люди рождаются свободными. So, the output of unicodedata will just give "". When it will be preferable to get at least something, no matter if it’s a correct transcription or not.

fauve
  • 226
  • 1
  • 10
  • 1
    Are you using Python 2 or 3? I ask because the `u` string prefix has no effect in Python 3, and Python 2 is EOL, but also Unicode handling was much more difficult in it. – wjandrea Sep 16 '22 at 19:49
  • 2
    Why do you want to do this? Any text that you put through this process is going to get garbled. For example, French `montré` "shown" -> `montre` "wristwatch", Turkish `ılık` "warm" -> `ilik` "marrow". Regarding non-Latin symbols, that might be more difficult than you think because how to convert them depends on the language, not just the script; for example, for Ukrainian, Cyrillic `И` transliterates to `y`, not `i`. – wjandrea Sep 16 '22 at 20:16
  • 2
    Does this answer your question? [What is the best way to remove accents (normalize) in a Python unicode string?](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string) Note: the question asks about accents only, but the answers mostly cover reduction of Unicode to plain ASCII. – lenz Sep 16 '22 at 20:36
  • 1
    @wjandrea I use python3 – fauve Sep 16 '22 at 20:36
  • @wjandrea yes, I know that latin translitaration depend on languages. And sometimes in a unic language, theire is many possible transliteration standard. But I don’t search a full perfect solution. Just someting to avoid getting someting instead of blank output when the given char is full of cyrillic symbols, as example. – fauve Sep 16 '22 at 20:38
  • Why not just output the Unicode text? Are you dealing with a system that only supports ASCII? – wjandrea Sep 16 '22 at 23:00
  • 2
    Use [Unidecode](https://pypi.org/project/Unidecode/). – Mark Ransom Sep 17 '22 at 02:46
  • @wjandrea there are lots of cases that require this, for example filling names, addresses... in a foreign form, or unaccented string match (most modern browsers do that by default) – phuclv Sep 17 '22 at 04:22
  • 1
    @wjandrea the usecase is the following one: A user give his full name correctly writing in any language, but an ascii ID is associate to his account, for url or any context requiring ascii script. – fauve Sep 17 '22 at 05:50
  • 1
    Is it necessary for the ID to be readable, or could you use a base64 encoding of utf-8 characters? – Mark Ransom Sep 17 '22 at 17:14
  • @fauve why don't just give separate ID and Name fields? That's what most websites that require an ASCII ID do. Or just use email as ID – phuclv Sep 18 '22 at 04:11

0 Answers0