12

I'm seeking simple Python function that takes a string and returns a similar one but with all non-ascii characters converted to their closest ascii equivalent. For example, diacritics and whatnot should be dropped. I'm imagining there must be a pretty canonical way to do this and there are plenty of related stackoverflow questions but I'm not finding a simple answer so it seemed worth a separate question.

Example input/output:

"Étienne" -> "Etienne"
dreeves
  • 26,430
  • 45
  • 154
  • 229
  • 6
    How do you define "closest?" – nmichaels Sep 30 '10 at 18:47
  • Good question! I guess I'm hoping not to have to define it, that there's some standard, accepted mapping somewhere. I'm sure this is hairier than I imagine to do really right, but partial solutions would be valuable as well. – dreeves Sep 30 '10 at 18:55
  • 2
    `iconv` can do it with a `//TRANSLIT` flag, not sure whether there are any proper Python bindings for it though. – Wrikken Sep 30 '10 at 18:57
  • Possible duplicates: http://stackoverflow.com/questions/3586903/sqlite-remove-non-utf-8-characters and http://stackoverflow.com/questions/2854230/whats-the-fastest-way-to-strip-and-replace-a-document-of-high-unicode-characters – unutbu Sep 30 '10 at 19:03
  • 2
    http://pypi.python.org/pypi/Unidecode/ related: http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string http://stackoverflow.com/questions/1192367/whats-a-good-way-to-replace-international-characters-with-their-base-latin-count http://stackoverflow.com/questions/2854230/whats-the-fastest-way-to-strip-and-replace-a-document-of-high-unicode-characters http://stackoverflow.com/questions/816285/where-is-pythons-best-ascii-for-this-unicode-database – jfs Sep 30 '10 at 19:04

4 Answers4

6

Reading this question made me go looking for something better.

https://pypi.python.org/pypi/Unidecode/0.04.1

Does exactly what you ask for.

Llanilek
  • 3,386
  • 5
  • 39
  • 65
3

In Python 3 and using the regex implementation at PyPI:

http://pypi.python.org/pypi/regex

Starting with the string:

>>> s = "Étienne"

Normalise to NFKD and then remove the diacritics:

>>> import unicodedata
>>> import regex
>>> regex.sub(r"\p{Mn}", "", unicodedata.normalize("NFKD", s))
'Etienne'
MRAB
  • 20,356
  • 6
  • 40
  • 33
  • 1
    That really doesn’t do much. For example, code point U+00F8, *ø*, does **not** decompose to something with Marks. But it still has the same primary collation strength as *o* has: 138E per DUCET 6.0. Similarly, there is no decomposition for code point U+00F0, *ð.* However, its primary collation strength is the same as a *d* at 1250. People need to learn to work *with* Unicode, not against it! – tchrist Apr 02 '11 at 03:12
  • I’ve looked at the library you mention, and it looks very exciting. Are you its author? I’ve been interested in a Python library with better Unicode support for quite a while now. Let me look it over and send you mail. Thanks very much. – tchrist Apr 02 '11 at 05:23
  • Can you explain the meaning of `r"\p{Mn}"`? I just read through the regex docs, and I don't understand what Mn signifies. – Coquelicot Apr 09 '13 at 14:25
  • `\p{Mn}` will match a codepoint which has the `Mn` or (or `Nonspacing_Mark`) Unicode property. Other properties include `Lu` (`Uppercase_Letter`) and `Cyrillic`. – MRAB Apr 09 '13 at 17:29
1

Doing a search for 'iconv TRANSLIT python' I found: http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/ which looks like it might be what you need. The comments have some other ideas which use the standard library instead.

There's also http://web.archive.org/web/20070807224749/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python/ which uses NFKD to get the base characters where possible.

Douglas Leeder
  • 52,368
  • 9
  • 94
  • 137
1

Read the answers to some of the duplicate questions. The NFKD gimmick works only as an accent stripper. It doesn't handle ligatures and lots of other Latin-based characters that can't be (or aren't) decomposed. For this a prepared translation table is necessary (and much faster).

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • Thanks John. I really hate to see people mutilating Unicode data. Usually it's because they don't know how to do a comparison at collation strength 1 (primary) only. For example, at level 1 there are 99 A's, 43 B's, 53 C's, etc. O has the most at 111, Q the fewest at 34. NFKD ups those numbers a bit, pusing A's to 115 and O's to 119 for example. – tchrist Apr 02 '11 at 03:07