Where is Python's "best ASCII for this Unicode" database?

Question

I have some text that uses Unicode punctuation, like left double quote, right single quote for apostrophe, and so on, and I need it in ASCII. Does Python have a database of these characters with obvious ASCII substitutes so I can do better than turning them all into "?" ?

People who find this might be interested in [What is the best way to remove accents in a Python unicode string?](http://stackoverflow.com/q/517923/562769) — Martin Thoma, Apr 11 '15 at 11:33

joeforker · Accepted Answer · 2009-11-09T15:18:42.677

90

Unidecode looks like a complete solution. It converts fancy quotes to ascii quotes, accented latin characters to unaccented and even attempts transliteration to deal with characters that don't have ASCII equivalents. That way your users don't have to see a bunch of ? when you had to pass their text through a legacy 7-bit ascii system.

>>> from unidecode import unidecode
>>> print unidecode(u"\u5317\u4EB0")
Bei Jing

http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

edited Nov 09 '09 at 15:18

answered Nov 09 '09 at 14:37

joeforker

40,459
37
151
246

3

Hm.. german umlauts are converted to their base character instead of e.g. ö=oe, ä=ae, etc. – ThiefMaster May 08 '14 at 14:00
5

@ThiefMaster are those equivalents true across all languages? Maybe Unidecode is going for the lowest common denominator. – Mark Ransom Jan 20 '15 at 20:36
Unidecode most certainly goes for the language-independent solution. For a German-centric solution, convert applicable charecters manually (`s/ö/oe/`, etc.) before cleaning up the rest with `unidecode`. – alexis Sep 12 '15 at 18:49
4

Indeed, in Finnish for example, while `ä -> a`, `ö -> o` is outright wrong, it is still preferable to `ae` and `oe` – Antti Haapala -- Слава Україні Dec 26 '15 at 17:44

Mike Spross · Answer 2 · 2009-05-06T05:00:58.660

In my original answer, I also suggested unicodedata.normalize. However, I decided to test it out and it turns out it doesn't work with Unicode quotation marks. It does a good job translating accented Unicode characters, so I'm guessing unicodedata.normalize is implemented using the unicode.decomposition function, which leads me to believe it probably can only handle Unicode characters that are combinations of a letter and a diacritical mark, but I'm not really an expert on the Unicode specification, so I could just be full of hot air...

In any event, you can use unicode.translate to deal with punctuation characters instead. The translate method takes a dictionary of Unicode ordinals to Unicode ordinals, thus you can create a mapping that translates Unicode-only punctuation to ASCII-compatible punctuation:

'Maps left and right single and double quotation marks'
'into ASCII single and double quotation marks'
>>> punctuation = { 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22 }
>>> teststring = u'\u201Chello, world!\u201D'
>>> teststring.translate(punctuation).encode('ascii', 'ignore')
'"hello, world!"'

You can add more punctuation mappings if needed, but I don't think you necessarily need to worry about handling every single Unicode punctuation character. If you do need to handle accents and other diacritical marks, you can still use unicodedata.normalize to deal with those characters.

score 21 · Answer 3 · answered May 03 '09 at 04:15

21

Interesting question.

Google helped me find this page which descibes using the unicodedata module as the following:

import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii','ignore')

answered May 03 '09 at 04:15

easel

3,982
26
28

Where is Python's "best ASCII for this Unicode" database?

3 Answers3

Linked

Related