2

I wrote a script reading XML files using minidom:

from xml.dom.minidom import parse
for File in Data['FileList']:
    Xml = parse(File)
#do something

which runs fine, but some guys are creating XMLs defining UTF-8 encoding in the XML and using German Umlaute in tags so I ran into xml.parsers.expat.ExpatError: not well-formed (invalid token).

If I change manually in the XML to encoding="ISO-8859-1" it runs fine.

Is there a more elegant way of changing the encoding, instead of editing the XML files, e.g. telling minidom to use a different encoding than defined in the XML?

Billal Begueradj
  • 20,717
  • 43
  • 112
  • 130
  • It is a serious error to create XML files with an XML declaration saying `encoding="UTF-8"` when the actual encoding is ISO-8859-1. I think you should tell the "guys" to stop creating these bad XML files. – mzjn Jun 05 '18 at 18:28

1 Answers1

0

I suggest you this solution:

Before parsing the file, open it normally and replace the first line of it which corresponds to the XML header with this line:

<?xml version="1.0" encoding="ISO-8859-1"?>

You then save the file and passe it to minidom.parse() function.

This may help you to replace the first line line in each file: Search and replace a line in a file in Python

Billal Begueradj
  • 20,717
  • 43
  • 112
  • 130
  • This is I want to avoid, but it seems the only robust way to read it in. So there seems to be no way of having an option in minidom to define a different Encoding while parsing the file. –  Jun 07 '18 at 05:52
  • I did not mean manually opening the file, but programmatically (something like this: `with open(File, 'rw') as my_file: #replace and save, then passe the file to parse()`) – Billal Begueradj Jun 07 '18 at 06:03