0

This question appears related to this one from 2013, but it didn't help me.

I'm about to parse a large (2GB) XML file, and plan to do it with Python 3.5.2 and ElementTree. I'm new to Python, but it works well until reaching any escape character, such as:

<author>Sanjeev Sax&ouml;na</author>

returning:

test.xml
  File "<string>", line unknown
ParseError: undefined entity &ouml;: line 5, column 19enter code here

My code looks something like this:

import xml.etree.ElementTree as etree
for event, elem in etree.iterparse('test_esc.xml'):
  # do something with the node

What's the best way to deal with this? Parsing the unescaped 'ö' actually works fine:

<author>Sanjeev Saxöna</author>

Is there an easy way to programmatically unescape the whole XML file?

Community
  • 1
  • 1

1 Answers1

0

As suggested by the answer linked by Soulaimane Sahmi, I added an inline DTD to the XML file. It is maybe not the best solution out there, but it works for now.