Converting accented characters into latin without compromising ElementTree

Question

I'm trying to figure out how I can replace all accented characters (å, é, í...) with their latin correspondants (a, e, i respectively) and I've tried several ways of doing this, but they all do something beyond my comprehension that makes it impossible for ElementTree to later convert with .fromstring().

I also have to escape ampersand characters, but that I have figured out.

Relevant syntax:

# -- coding: utf-8 --

import xml.etree.ElementTree as ET
import os
import re

path = "C:\\Users\\SuperUser\\Desktop\\audit\\audit\\saved\\audit"

root = ET.Element("root")

for filename in os.listdir(path):
    with open(path + "\\" + filename) as myfile:
        lines = myfile.readlines()

    for line in lines:
        line = re.sub(r"&(?!#\d{3};|amp;)", "&amp;", line)
        xmlVal = ET.fromstring(line)

It is on this last line that the error occur, where it has with other solutions complained with a UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in position 161: ordinal not in range(128), or a similar error.

Note that [this answer](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) from the linked question shows how to do this "by hand", using the standard `unicodedata` module. — PM 2Ring, May 16 '18 at 13:06

Rakesh · Answer 1 · 2018-05-16T13:00:24.003

Try using the unidecode module

Ex:

import xml.etree.ElementTree as ET
import os
import re
import unidecode


path = "C:\\Users\\SuperUser\\Desktop\\audit\\audit\\saved\\audit"

root = ET.Element("root")

for filename in os.listdir(path):
    with open(path + "\\" + filename) as myfile:
        lines = myfile.readlines()

    for line in lines:
        line = unidecode.unidecode(line)
        xmlVal = ET.fromstring(line)

Converting accented characters into latin without compromising ElementTree

1 Answers1