I need to parse a 1.2GB XML file that has an encoding of "ISO-8859-1", and after reading a few articles on the NET, it seems that Python's ElementTree's iterparse() is preferred as to SAX parsing.
I've written a extremely short piece of code just to test it out, but it's prompting out an error that I've no idea how to solve.
My Code (Python 2.7):
from xml.etree.ElementTree import iterparse
for (event, node) in iterparse('dblp.xml', events=['start']):
print node.tag
node.clear()
Edit: Ahh, as the file was really big and laggy, I typed out the XML line, and made a mistake. It's "& uuml;" without the space. I apologize for this.
This code works fine until it hits a line in the XML file that looks like this:
<Journal>Technical Report 248, ETH Zürich, Dept of Computer Science</Journal>
which I guess means Zurich, but the parser does not seem to know this.
Running the code above gave me an error:
xml.etree.ElementTree.ParseError: undefined entity ü
Is there anyway I could solve this issue? I've googled quite a few solutions, but none seem to deal with this problem directly.