Is there a way to recover iterparse on invalid Char values?

Question

I'm using lxml's iterparse to parse some big XML files (3-5Gig). Since some of these files have invalid characters a lxml.etree.XMLSyntaxError is thrown.

When using lxml.etree.parse I can provide a parser which recovers on invalid characters:

parser = lxml.etree.XMLParser(recover=True)
root = lxml.etree.parse(open("myMalformed.xml, parser)

Is there a way to get the same functionality for iterparse?

Edit: Encoding is not an Issue here. There are invalid characters in these XML files which can be sanitized by defining a XMLParser with recover=True. Since I need to use iterparse for this, I can't use a custom parser. So I'm looking for the functionality provided in my snippet above for this here:

context = etree.iterparse(open("myMalformed.xml", events=('end',), tag="Foo") <-- cant recover

`iterparse()` has had the `recover` option since lxml 3.3.0. See https://stackoverflow.com/a/70492671/407651. — mzjn, Dec 27 '21 at 08:53

score 0 · Answer 1 · answered Feb 18 '13 at 12:46

0

When you say invalid characters, do you mean unicode characters? If so you can try

lxml.etree.XMLParser(encoding='UTF-8', recover=True)

If you mean malformed XML then this obviously won't work. If you can post your traceback, we can see the nature of the XMLSyntaxError which will provide more information.

answered Feb 18 '13 at 12:46

danodonovan

19,636
10
70
78

Thanks for your answer. No, I mean invalid bytes in my XML. This has nothing to do with unicode. The snippet I provided is running without errors, but since etree.parse loads the DOM into RAM this can't be used for extremely large files. I'm looking for the same functionality for iterparse. – Jay Feb 18 '13 at 13:01

Is there a way to recover iterparse on invalid Char values?

1 Answers1

Linked