I'm trying to parse an XML file that's over 2GB with Python's lxml library. Unfortunately, the XML file does not have a line telling the character encoding, so I have to manually set it. While iterating through the file though, there are still some strange characters that come up once in a while.
I'm not sure how to determine the character encoding of the line, but furthermore, lxml will raise an XMLSyntaxError from the scope of the for loop. How can I properly catch this error, and deal with it correctly? Here's a simplistic code snippet:
from lxml import etree
etparse = etree.iterparse(file("my_file.xml", 'r'), events=("start",), encoding="CP1252")
for event, elem in etparse:
if elem.tag == "product":
print "Found the product!"
elem.clear()
This eventually produces the error:
XMLSyntaxError: PCDATA invalid Char value 31, line 1565367, column 50
That line of the file looks like this:
% sed -n "1565367 p" my_file.xml
<romance_copy>Ravioli Florentine. Tender Ravioli Filled With Creamy Ricotta Cheese And
The 'F' of filled actually looks like this in my terminal: