2

I'm using lxml's iterparse to parse some big XML files (3-5Gig). Since some of these files have invalid characters a lxml.etree.XMLSyntaxError is thrown.

When using lxml.etree.parse I can provide a parser which recovers on invalid characters:

parser = lxml.etree.XMLParser(recover=True)
root = lxml.etree.parse(open("myMalformed.xml, parser)

Is there a way to get the same functionality for iterparse?

Edit: Encoding is not an Issue here. There are invalid characters in these XML files which can be sanitized by defining a XMLParser with recover=True. Since I need to use iterparse for this, I can't use a custom parser. So I'm looking for the functionality provided in my snippet above for this here:

context = etree.iterparse(open("myMalformed.xml", events=('end',), tag="Foo") <-- cant recover
Jay
  • 2,519
  • 5
  • 25
  • 42
  • `iterparse()` has had the `recover` option since lxml 3.3.0. See https://stackoverflow.com/a/70492671/407651. – mzjn Dec 27 '21 at 08:53

1 Answers1

0

When you say invalid characters, do you mean unicode characters? If so you can try

lxml.etree.XMLParser(encoding='UTF-8', recover=True)

If you mean malformed XML then this obviously won't work. If you can post your traceback, we can see the nature of the XMLSyntaxError which will provide more information.

danodonovan
  • 19,636
  • 10
  • 70
  • 78
  • Thanks for your answer. No, I mean invalid bytes in my XML. This has nothing to do with unicode. The snippet I provided is running without errors, but since etree.parse loads the DOM into RAM this can't be used for extremely large files. I'm looking for the same functionality for iterparse. – Jay Feb 18 '13 at 13:01