Python regex to convert non-ascii characters in a string to closest ascii equivalents

Question

I'm seeking simple Python function that takes a string and returns a similar one but with all non-ascii characters converted to their closest ascii equivalent. For example, diacritics and whatnot should be dropped. I'm imagining there must be a pretty canonical way to do this and there are plenty of related stackoverflow questions but I'm not finding a simple answer so it seemed worth a separate question.

Example input/output:

"Étienne" -> "Etienne"

Good question! I guess I'm hoping not to have to define it, that there's some standard, accepted mapping somewhere. I'm sure this is hairier than I imagine to do really right, but partial solutions would be valuable as well. — dreeves, Sep 30 '10 at 18:55
`iconv` can do it with a `//TRANSLIT` flag, not sure whether there are any proper Python bindings for it though. — Wrikken, Sep 30 '10 at 18:57
Possible duplicates: http://stackoverflow.com/questions/3586903/sqlite-remove-non-utf-8-characters and http://stackoverflow.com/questions/2854230/whats-the-fastest-way-to-strip-and-replace-a-document-of-high-unicode-characters — unutbu, Sep 30 '10 at 19:03
http://pypi.python.org/pypi/Unidecode/ related: http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string http://stackoverflow.com/questions/1192367/whats-a-good-way-to-replace-international-characters-with-their-base-latin-count http://stackoverflow.com/questions/2854230/whats-the-fastest-way-to-strip-and-replace-a-document-of-high-unicode-characters http://stackoverflow.com/questions/816285/where-is-pythons-best-ascii-for-this-unicode-database — jfs, Sep 30 '10 at 19:04

score 6 · Answer 1 · answered Nov 04 '13 at 14:32

6

Reading this question made me go looking for something better.

https://pypi.python.org/pypi/Unidecode/0.04.1

Does exactly what you ask for.

answered Nov 04 '13 at 14:32

Llanilek

3,386
5
39
65

Just `pip install unidecode` and it works even with Chinese! Thanks! – Adam May 05 '14 at 10:42

MRAB · Answer 2 · 2011-04-02T01:44:36.643

3

In Python 3 and using the regex implementation at PyPI:

http://pypi.python.org/pypi/regex

Starting with the string:

>>> s = "Étienne"

Normalise to NFKD and then remove the diacritics:

>>> import unicodedata
>>> import regex
>>> regex.sub(r"\p{Mn}", "", unicodedata.normalize("NFKD", s))
'Etienne'

edited Apr 02 '11 at 01:44

answered Apr 02 '11 at 01:24

MRAB

20,356
6
40
33

1

That really doesn’t do much. For example, code point U+00F8, *ø*, does **not** decompose to something with Marks. But it still has the same primary collation strength as *o* has: 138E per DUCET 6.0. Similarly, there is no decomposition for code point U+00F0, *ð.* However, its primary collation strength is the same as a *d* at 1250. People need to learn to work *with* Unicode, not against it! – tchrist Apr 02 '11 at 03:12
I’ve looked at the library you mention, and it looks very exciting. Are you its author? I’ve been interested in a Python library with better Unicode support for quite a while now. Let me look it over and send you mail. Thanks very much. – tchrist Apr 02 '11 at 05:23
Can you explain the meaning of `r"\p{Mn}"`? I just read through the regex docs, and I don't understand what Mn signifies. – Coquelicot Apr 09 '13 at 14:25
`\p{Mn}` will match a codepoint which has the `Mn` or (or `Nonspacing_Mark`) Unicode property. Other properties include `Lu` (`Uppercase_Letter`) and `Cyrillic`. – MRAB Apr 09 '13 at 17:29

score 1 · Answer 3 · answered Oct 04 '10 at 10:11

Doing a search for 'iconv TRANSLIT python' I found: http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/ which looks like it might be what you need. The comments have some other ideas which use the standard library instead.

There's also http://web.archive.org/web/20070807224749/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python/ which uses NFKD to get the base characters where possible.

score 1 · Answer 4 · answered Apr 02 '11 at 01:30

1

Read the answers to some of the duplicate questions. The NFKD gimmick works only as an accent stripper. It doesn't handle ligatures and lots of other Latin-based characters that can't be (or aren't) decomposed. For this a prepared translation table is necessary (and much faster).

answered Apr 02 '11 at 01:30

John Machin

81,303
11
141
189

Thanks John. I really hate to see people mutilating Unicode data. Usually it's because they don't know how to do a comparison at collation strength 1 (primary) only. For example, at level 1 there are 99 A's, 43 B's, 53 C's, etc. O has the most at 111, Q the fewest at 34. NFKD ups those numbers a bit, pusing A's to 115 and O's to 119 for example. – tchrist Apr 02 '11 at 03:07

Python regex to convert non-ascii characters in a string to closest ascii equivalents

4 Answers4

Linked