Translate Unicode to ascii (if possible)

Question

There are some unicode characters which could simplified to ascii without loosing much.

Example:

>>> unicodedata.name(u'-')
'HYPHEN-MINUS'

>>> unicodedata.name(u'−')
'MINUS SIGN'

In above case I prefer "HYPHEN-MINUS", since "MINUS SIGN" is not ascii.

I could write my own translator easily, but I don't like re-inventing the wheel.

Is there no simpler way to translate special unicode characters to simple ascii characters?

I know this is guessing and only works for some unicode characters, but that's ok in this context.

Simplest way: use a mapping like you did, but don't re-invent the wheel. Use [Unidecode](https://pypi.python.org/pypi/Unidecode) instead (and yes, it maps MINUS SIGN to HYPHEN-MINUS). — Martijn Pieters, Apr 18 '17 at 12:15

score 2 · Answer 1 · answered Apr 12 '17 at 12:53

This may not be the perfect answer. Unicode consortium has draft TR36 to deal with character similarities in unicode(not just ASCII).

You can search for python modules that the developer make the best effort to map them. A proof of concept homoglyph attack for similar looking to ascii character and symbol unicode character can be found here. (Due to font issues, some character or symbol might shown as square boxes by your browser)

You can make use of these python confusable homoglyphs package. The documentation is shown here.

from confusable_homoglyphs import confusables
confusables.is_confusable.is_confusable("-")

results

[{'homoglyphs': [{'c': '‐', 'n': 'HYPHEN'}, {'c': '‑', 'n': 'NON-BREAKING HYPHEN'}, {'c': '‒', 'n': 'FIGURE DASH'}, {'c': '–', 'n': 'EN DASH'}, {'c': '﹘', 'n': 'SMALL EM DASH'}, {'c': '\u200e۔\u200e', 'n': 'ARABIC FULL STOP'}, {'c': '⁃', 'n': 'HYPHEN BULLET'}, {'c': '˗', 'n': 'MODIFIER LETTER MINUS SIGN'}, {'c': '−', 'n': 'MINUS SIGN'}, {'c': '➖', 'n': 'HEAVY MINUS SIGN'}, {'c': 'Ⲻ', 'n': 'COPTIC CAPITAL LETTER DIALECT-P NI'}], 'alias': 'COMMON', 'character': '-'}]

Now you need to decide which is your preferable remap. Checkout the source code if you want to take some concept out of the libraries.

Thank you very much for this answer. I was not aware of the term "Homoglyph" before. This and the link to the python package helped a lot. BTW, I added this to my list "join softwarerecs and stackoverflow": https://github.com/guettli/join-stackoverflow-and-softwarerecs/blob/master/README.md — guettli, Apr 13 '17 at 07:25
Why is this no "perfect" answer according to your point of view? — guettli, Apr 13 '17 at 07:26
@guettli Because you ask for a ready made unicode to ascii mapper. The Homoglphy-confusables author will make the best effort to map most character, which need continuous feedback and contribution to make it better. — mootmoot, Apr 13 '17 at 07:35
for the answer is very good. .. I don't like the word "perfect". It sound like mature, sounds like "no progress" :-) — guettli, Apr 13 '17 at 07:39

score 1 · Answer 2 · edited May 23 '17 at 11:46

There is useful information regarding inconsistnecies in unicode character naming here: Python library to translate multi-byte characters into 7-bit ASCII in Python and here: Translating multi-byte characters into 7-bit ASCII in Python

But to answer your questionm it looks like there is no standard library for translating multi-byte unicode into ascii. See the second link if you do not yet have your own solution.

Translate Unicode to ascii (if possible)

2 Answers2

Linked

Related