1

I'm trying to figure out how I can replace all accented characters (å, é, í...) with their latin correspondants (a, e, i respectively) and I've tried several ways of doing this, but they all do something beyond my comprehension that makes it impossible for ElementTree to later convert with .fromstring().

I also have to escape ampersand characters, but that I have figured out.

Relevant syntax:

# -- coding: utf-8 --

import xml.etree.ElementTree as ET
import os
import re

path = "C:\\Users\\SuperUser\\Desktop\\audit\\audit\\saved\\audit"

root = ET.Element("root")

for filename in os.listdir(path):
    with open(path + "\\" + filename) as myfile:
        lines = myfile.readlines()

    for line in lines:
        line = re.sub(r"&(?!#\d{3};|amp;)", "&", line)
        xmlVal = ET.fromstring(line)

It is on this last line that the error occur, where it has with other solutions complained with a UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in position 161: ordinal not in range(128), or a similar error.

akshat
  • 1,219
  • 1
  • 8
  • 24
Lycan
  • 23
  • 5
  • 1
    Note that [this answer](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) from the linked question shows how to do this "by hand", using the standard `unicodedata` module. – PM 2Ring May 16 '18 at 13:06

1 Answers1

1

Try using the unidecode module

Ex:

import xml.etree.ElementTree as ET
import os
import re
import unidecode


path = "C:\\Users\\SuperUser\\Desktop\\audit\\audit\\saved\\audit"

root = ET.Element("root")

for filename in os.listdir(path):
    with open(path + "\\" + filename) as myfile:
        lines = myfile.readlines()

    for line in lines:
        line = unidecode.unidecode(line)
        xmlVal = ET.fromstring(line)
Rakesh
  • 81,458
  • 17
  • 76
  • 113