I'm trying to figure out how I can replace all accented characters (å
, é
, í
...) with their latin correspondants (a
, e
, i
respectively) and I've tried several ways of doing this, but they all do something beyond my comprehension that makes it impossible for ElementTree to later convert with .fromstring()
.
I also have to escape ampersand characters, but that I have figured out.
Relevant syntax:
# -- coding: utf-8 --
import xml.etree.ElementTree as ET
import os
import re
path = "C:\\Users\\SuperUser\\Desktop\\audit\\audit\\saved\\audit"
root = ET.Element("root")
for filename in os.listdir(path):
with open(path + "\\" + filename) as myfile:
lines = myfile.readlines()
for line in lines:
line = re.sub(r"&(?!#\d{3};|amp;)", "&", line)
xmlVal = ET.fromstring(line)
It is on this last line that the error occur, where it has with other solutions complained with a UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in position 161: ordinal not in range(128)
, or a similar error.