12

How can I replace non-ascii chars from a unicode string in Python?

This are the output I spect for the given inputs:

música -> musica

cartón -> carton

caño -> cano

Myaybe with a dict where 'á' is a key and 'a' a value?

Juanjo Conti
  • 28,823
  • 42
  • 111
  • 133
  • 2
    possible duplicate of [What is the best way to remove accents in a python unicode string?](http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) – nosklo Sep 13 '10 at 22:04

2 Answers2

21

If all you want to do is degrade accented characters to their non-accented equivalent:

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u"m\u00fasica").encode('ascii', 'ignore')
'musica'
llasram
  • 4,417
  • 28
  • 28
  • What does ```NFKD``` do? – Bikash Gyawali Oct 15 '19 at 11:20
  • 1
    @bikashg "Normalization Form Compatibility Decomposition." Decomposes the string by "compatibility," which both decomposes any precombined characters into an equivalent sequence of combining characters but also transforms e.g. ligatures into the semantically-equivalent sequence of composing characters. See https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization for full details and references to relevant standards. – llasram Oct 16 '19 at 14:52
7

Now, just to supplement that answer: It may be the case that your data does not come in unicode (i.e. you are reading a file with another encoding and you cannot prefix the string with a "u"). Here's a snippet that may work too (mostly for those reading files in English).

import unicodedata
unicodedata.normalize('NFKD',unicode(someString,"ISO-8859-1")).encode("ascii","ignore")
fiacobelli
  • 1,960
  • 5
  • 24
  • 31