reading XML with different encoding using python mindom

Question

I wrote a script reading XML files using minidom:

from xml.dom.minidom import parse
for File in Data['FileList']:
    Xml = parse(File)
#do something

which runs fine, but some guys are creating XMLs defining UTF-8 encoding in the XML and using German Umlaute in tags so I ran into xml.parsers.expat.ExpatError: not well-formed (invalid token).

If I change manually in the XML to encoding="ISO-8859-1" it runs fine.

Is there a more elegant way of changing the encoding, instead of editing the XML files, e.g. telling minidom to use a different encoding than defined in the XML?

It is a serious error to create XML files with an XML declaration saying `encoding="UTF-8"` when the actual encoding is ISO-8859-1. I think you should tell the "guys" to stop creating these bad XML files. — mzjn, Jun 05 '18 at 18:28

Billal Begueradj · Accepted Answer · 2018-06-05T17:03:31.397

0

I suggest you this solution:

Before parsing the file, open it normally and replace the first line of it which corresponds to the XML header with this line:

<?xml version="1.0" encoding="ISO-8859-1"?>

You then save the file and passe it to minidom.parse() function.

This may help you to replace the first line line in each file: Search and replace a line in a file in Python

edited Jun 05 '18 at 17:03

answered Jun 05 '18 at 15:15

Billal Begueradj

20,717
43
112
130

This is I want to avoid, but it seems the only robust way to read it in. So there seems to be no way of having an option in minidom to define a different Encoding while parsing the file. – Jun 07 '18 at 05:52
I did not mean manually opening the file, but programmatically (something like this: `with open(File, 'rw') as my_file: #replace and save, then passe the file to parse()`) – Billal Begueradj Jun 07 '18 at 06:03

reading XML with different encoding using python mindom

1 Answers1