Access the processing-instructions before/after a root element with lxml

Question

Using lxml, how can I access/iterate the processing-instructions located before the root open tag or after the root close tag?

I have try this, but, according to the documentation, it only iterates inside the root element:

import io

from lxml import etree

content = """\
<?before1?>
<?before2?>
<root>text</root>
<?after1?>
<?after2?>
"""

source = etree.parse(io.StringIO(content))

print(etree.tostring(source, encoding="unicode"))
# -> <?before1?><?before2?><root>text</root><?after1?><?after2?>

for node in source.iter():
    print(type(node))
# -> <class 'lxml.etree._Element'>

My only solution is to wrap the XML with a dummy element:

dummy_content = "<dummy>{}</dummy>".format(etree.tostring(source, encoding="unicode"))
dummy = etree.parse((io.StringIO(dummy_content)))

for node in dummy.iter():
    print(type(node))
# -> <class 'lxml.etree._Element'>
#    <class 'lxml.etree._ProcessingInstruction'>
#    <class 'lxml.etree._ProcessingInstruction'>
#    <class 'lxml.etree._Element'>
#    <class 'lxml.etree._ProcessingInstruction'>
#    <class 'lxml.etree._ProcessingInstruction'>

Is there a better solution?

I don't why it happens but FYI, if you `from lxml.html import fromstring` and then `source = fromstring(content)`, your `for` loop can access the root node as well as the two processing instructions after, but not the two before.... — Jack Fleeting, Jul 17 '19 at 19:06
You are certainly right @JackFleeting, but I don't want to use HTML parser. I really want to work with XML. — Laurent LAPORTE, Jul 17 '19 at 19:09
I realize that - I was just wondering why the parser would distinguish between the "before" and "after". It's counter intuitive (at least to me). — Jack Fleeting, Jul 17 '19 at 19:16

mzjn · Accepted Answer · 2019-07-17T19:09:41.897

3

You can use the getprevious() and getnext() methods on the root element.

before2 = source.getroot().getprevious()
before1 = before2.getprevious()

after1 = source.getroot().getnext()
after2 = after1.getnext()

See https://lxml.de/api/lxml.etree._Element-class.html.

Using XPath (on the ElementTree or Element instance) is also possible:

before = source.xpath("preceding-sibling::node()")  # List of two PIs
after = source.xpath("following-sibling::node()")

edited Jul 17 '19 at 19:09

answered Jul 17 '19 at 18:51

mzjn

48,958
13
128
248

Clever and simple. I missed those methods in the doc, searching in _ElementTree instead of _Element. Thanks. – Laurent LAPORTE Jul 17 '19 at 18:59
With solution using xpath, the PIs are ordered in document order, which is better for me. Good! – Laurent LAPORTE Jul 17 '19 at 19:12

Access the processing-instructions before/after a root element with lxml

1 Answers1