Removing diacritical marks using Python

Question

I have a couple of text files with characters which has diacritical marks, for example è, á, ô and so on. I'd like to replace these characters with e, a, o, etc

How can I achieve this in Python? Grateful for help!

Put your replacements in a dictionary, open the text files and use an answer from https://stackoverflow.com/questions/2400504/easiest-way-to-replace-a-string-using-a-dictionary-of-replacements — Jonathan Scholbach, Jan 25 '18 at 14:44
This question should be closed, because the problem has answers in StackOverflow. If it is about how to open a text file, or how to replace in string, there are sufficient answers around. That's why I give -1 — Jonathan Scholbach, Jan 25 '18 at 14:46
you should use the complete unicode homographs table just to make sure you dont miss any, taken from this answer https://stackoverflow.com/questions/9491890/is-there-a-list-of-characters-that-look-similar-to-english-letters — AntiMatterDynamite, Jan 25 '18 at 14:47

score 10 · Accepted Answer · answered Jan 25 '18 at 14:46

10

Try unidecode (you may need to install it).

>>> from unidecode import unidecode
>>> s = u"é"
>>> unidecode(s)
'e'

answered Jan 25 '18 at 14:46

Demosthenes

1,515
10
22

score 2 · Answer 2 · answered Jan 25 '18 at 14:53

Example of what you could do:

 accented_string = u'Málaga'
`enter code here`# accented_string is of type 'unicode'
 import unidecode
 unaccented_string = unidecode.unidecode(accented_string)
 # unaccented_string contains 'Malaga'and is of type 'str'

A very similar example of your problem. Check this: What is the best way to remove accents in a Python unicode string?

score 1 · Answer 3 · answered Oct 15 '19 at 12:24

In Python 3, you simply need to use the unidecode package. It works with both lowercase and uppercase letters.

Installing the package: (you may need to use pip3 instead of pip depending on your system and setup)

$ pip install unidecode

Then using it as follows:

from unidecode import unidecode

text = ["ÉPÍU", "Naïve Café", "EL NIÑO"]

text1 = [unidecode(s) for s in text]
print(text1)
# ['EPIU', 'Naive Cafe', 'EL NINO']

text2 = [unidecode(s.lower()) for s in text]
print(text2)
# ['epiu', 'naive cafe', 'el nino']

Isn't this just a repeat of [the answer from January, 2018](https://stackoverflow.com/a/48445588/354577)? — ChrisGPT was on strike, Feb 22 '22 at 16:14

Removing diacritical marks using Python

3 Answers3

Linked