2

Does anyone have any tips on how to use lxml.objectify with recover=True?

I have xml where the attributes are not quoted --> name=value instead of name='value'.

Below is some sample code... I do not have control over the XML formatting so I can not go back and have it changed. The etree parsing does work

The error is

File "<string>", line unknown
XMLSyntaxError: AttValue: " or ' expected, line 4, column 21

lxml.objectify CODE -- FAILS

xmlSample="""<dict>
<maptable>
  <hdterm displevel=1 autlookup entrytype=1>Source term</hdterm>
</maptable>
</dict>"""

If I don't get an answer do I have to re

import io
#p = objectify.XMLParser(recover=True)

root = objectify.fromstring(xmlSample)

# returns attributes in element node as dict
attrib = root.getattrib()

# how to extract element data
tbl = root.mytable

print("root.mytable type=%s" % type(tbl))

lxml.etree - WORKS!

from lxml import etree, objectify

import io
xmlIO = io.StringIO(xmlSample)

p = etree.XMLParser(recover=True)

tree = etree.parse(xmlIO, parser=p)
root = tree.getroot()
print(root.tag)

OUTPUT:

myxml
har07
  • 88,338
  • 12
  • 84
  • 137
frankr6591
  • 1,211
  • 1
  • 8
  • 14

1 Answers1

1

UPDATE :

Turned out you can pass the recover=True option to objectify.makeparser() to create a parser that will try to recover malformed XML document. Then you can pass the created parser to objectify.fromstring(), like so :

from lxml import etree, objectify

xmlSample="""<dict>
<maptable>
  <hdterm displevel=1 autlookup entrytype=1>Source term</hdterm>
</maptable>
</dict>"""

parser = objectify.makeparser(recover=True)
root = objectify.fromstring(xmlSample, parser)

print(type(root.maptable.hdterm))
# output :
# <type 'lxml.objectify.StringElement'>

INITIAL ANSWER :

You can combine the two; etree with recover=True to fix the broken XML input, and then objectify to parse the well-formed intermediate XML :

from lxml import etree, objectify

xmlSample="""your_xml_here"""

p = etree.XMLParser(recover=True)
well_formed_xml = etree.fromstring(xmlSample, p)
root = objectify.fromstring(etree.tostring(well_formed_xml))
har07
  • 88,338
  • 12
  • 84
  • 137
  • Thanks! Thought about this. Actual XML are very large files and this approach will required parsing it 3 times: xml->etree, etree->good_xml, good_xml->objects. Plus tracing data back to original XML could be problematic. – frankr6591 Mar 23 '16 at 11:10
  • I guess there is no objectify.XMLParser(recover=True)? Or way to replace objectify parser? – frankr6591 Mar 23 '16 at 11:11
  • @frankr6591 I think I found the way to pass `recover=True` option to `objecdtify`! See **UPDATE** section above – har07 Mar 23 '16 at 11:52
  • I accepted your answer above as it is the best answer. Yet, it did not resolve the root issue. I finally had to use etree.iterparse because the combination of recover=True *and* html=True was needed to be forgiving for the xml I was parsing. – frankr6591 Mar 29 '16 at 19:26