4

I have an XML with invalid characters. LXML's XMLParser throws an exception on these invalid characters, but when I create XMLParser with recover=True option, it ignores the bad characters and works OK.

My question is how can I set similar flag for lxml's iterparse function?

Reproduction:

The broken XML (/tmp/z.xml):

<?xml version="1.0" encoding="utf-8"?>
<items>
    <item>
        <B>Bad characters:</B>
    </item>
</items>

NOTE: There are two ASCII characters #31 (0x1F) after "Bad characters:" string, which I could not copy-paste here.

The parsing error of XMLParser:

fd = open('/tmp/z.xml')
parser = etree.XMLParser()
tree   = etree.parse(fd, parser)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 2576, in lxml.etree.parse (src/lxml/lxml.etree.c:22796)
  File "parser.pxi", line 1488, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:60390)
  File "parser.pxi", line 1518, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:60687)
  File "parser.pxi", line 1401, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:59658)
  File "parser.pxi", line 991, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:57303)
  File "parser.pxi", line 538, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:53512)
  File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:54372)
  File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 31, line 4, column 21

To ignore the bad characters I set recover=True and it works OK:

import lxml.etree as etree
fd = open('/tmp/z.xml')
parser = etree.XMLParser(recover=True)
tree   = etree.parse(fd, parser)
etree.tostring(tree)

# OUTPUT:
<items>\n\t<item>\n\t\t<B>Bad characters:</B>\n\t</item>\n</items>'

With iterparse I get the same error again, but how can I make it ignore the bad characters?

fd = open('/tmp/z.xml')
it = etree.iterparse(fd, events=("start", "end"))
for e in it: print e
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "iterparse.pxi", line 498, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:73245)
  File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 31, line 4, column 21
Cœur
  • 37,241
  • 25
  • 195
  • 267
diemacht
  • 2,022
  • 7
  • 30
  • 44
  • 1
    http://comments.gmane.org/gmane.comp.python.lxml.devel/6235 read the last post. it seems that this is currently not possible with iterparse. – pypat May 06 '13 at 10:23
  • Looks like a duplicate of http://stackoverflow.com/q/14934854/407651, but that question has no accepted or upvoted answer. I suspect that the answer is what @pypat says. – mzjn May 07 '13 at 08:17
  • @mzjn: Right, seems like it's the same question. – diemacht May 07 '13 at 15:36

1 Answers1

2

iterparse also accepts the recover argument:

it = etree.iterparse(fd, events=("start", "end"), recover=True)

( Documentation: lxml iterparse )

ellockie
  • 3,730
  • 6
  • 42
  • 44
  • 1
    The `recover` option was added to `iterparse()` in lxml 3.3.0: https://lxml.de/3.3/changes-3.3.0.html. – mzjn Dec 27 '21 at 08:38