Replace non-ascii chars from a unicode string in Python

Question

How can I replace non-ascii chars from a unicode string in Python?

This are the output I spect for the given inputs:

música -> musica

cartón -> carton

caño -> cano

Myaybe with a dict where 'á' is a key and 'a' a value?

possible duplicate of [What is the best way to remove accents in a python unicode string?](http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) — nosklo, Sep 13 '10 at 22:04

score 21 · Accepted Answer · answered Sep 13 '10 at 22:07

21

If all you want to do is degrade accented characters to their non-accented equivalent:

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u"m\u00fasica").encode('ascii', 'ignore')
'musica'

answered Sep 13 '10 at 22:07

llasram

4,417
28
28

What does ```NFKD``` do? – Bikash Gyawali Oct 15 '19 at 11:20
1

@bikashg "Normalization Form Compatibility Decomposition." Decomposes the string by "compatibility," which both decomposes any precombined characters into an equivalent sequence of combining characters but also transforms e.g. ligatures into the semantically-equivalent sequence of composing characters. See https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization for full details and references to relevant standards. – llasram Oct 16 '19 at 14:52

score 7 · Answer 2 · answered Feb 09 '13 at 06:35

Now, just to supplement that answer: It may be the case that your data does not come in unicode (i.e. you are reading a file with another encoding and you cannot prefix the string with a "u"). Here's a snippet that may work too (mostly for those reading files in English).

import unicodedata
unicodedata.normalize('NFKD',unicode(someString,"ISO-8859-1")).encode("ascii","ignore")

Replace non-ascii chars from a unicode string in Python

2 Answers2

Linked