3

I have a couple of text files with characters which has diacritical marks, for example è, á, ô and so on. I'd like to replace these characters with e, a, o, etc

How can I achieve this in Python? Grateful for help!

sunyata
  • 1,843
  • 5
  • 27
  • 41
  • 1
    `text.replace('é','e')` –  Jan 25 '18 at 14:44
  • Put your replacements in a dictionary, open the text files and use an answer from https://stackoverflow.com/questions/2400504/easiest-way-to-replace-a-string-using-a-dictionary-of-replacements – Jonathan Scholbach Jan 25 '18 at 14:44
  • 2
    This question should be closed, because the problem has answers in StackOverflow. If it is about how to open a text file, or how to replace in string, there are sufficient answers around. That's why I give -1 – Jonathan Scholbach Jan 25 '18 at 14:46
  • you should use the complete unicode homographs table just to make sure you dont miss any, taken from this answer https://stackoverflow.com/questions/9491890/is-there-a-list-of-characters-that-look-similar-to-english-letters – AntiMatterDynamite Jan 25 '18 at 14:47

3 Answers3

10

Try unidecode (you may need to install it).

>>> from unidecode import unidecode
>>> s = u"é"
>>> unidecode(s)
'e'
Demosthenes
  • 1,515
  • 10
  • 22
2

Example of what you could do:

 accented_string = u'Málaga'
`enter code here`# accented_string is of type 'unicode'
 import unidecode
 unaccented_string = unidecode.unidecode(accented_string)
 # unaccented_string contains 'Malaga'and is of type 'str'

A very similar example of your problem. Check this: What is the best way to remove accents in a Python unicode string?

1

In Python 3, you simply need to use the unidecode package. It works with both lowercase and uppercase letters.

Installing the package: (you may need to use pip3 instead of pip depending on your system and setup)

$ pip install unidecode

Then using it as follows:

from unidecode import unidecode

text = ["ÉPÍU", "Naïve Café", "EL NIÑO"]

text1 = [unidecode(s) for s in text]
print(text1)
# ['EPIU', 'Naive Cafe', 'EL NINO']

text2 = [unidecode(s.lower()) for s in text]
print(text2)
# ['epiu', 'naive cafe', 'el nino']
TDT
  • 51
  • 5