I'm using lxml's iterparse
to parse some big XML files (3-5Gig). Since some of these files have invalid characters a lxml.etree.XMLSyntaxError
is thrown.
When using lxml.etree.parse I can provide a parser which recovers on invalid characters:
parser = lxml.etree.XMLParser(recover=True)
root = lxml.etree.parse(open("myMalformed.xml, parser)
Is there a way to get the same functionality for iterparse?
Edit: Encoding is not an Issue here. There are invalid characters in these XML files which can be sanitized by defining a XMLParser with recover=True. Since I need to use iterparse for this, I can't use a custom parser. So I'm looking for the functionality provided in my snippet above for this here:
context = etree.iterparse(open("myMalformed.xml", events=('end',), tag="Foo") <-- cant recover