0

I have an XML file

<?xml version="1.0" encoding="UTF-8"?>
<?foo class="abc" options="bar,baz"?>
<document>
 ...
</document>

and I'm interested in the processing instruction foo and its attributes.

I can use ET.iterparse for reading the PI, but it escapes me how to access the attributes as a dictionary – .attrib only gives an empty dict.

import xml.etree.ElementTree as ET

for _, elem in ET.iterparse("data.xml", events=("pi",)):
    print(repr(elem.tag))
    print(repr(elem.text))
    print(elem.attrib)
<function ProcessingInstruction at 0x7f848f2f7ba0>
'foo class="abc" options="bar,baz"'
{}

Any hints?

Nico Schlömer
  • 53,797
  • 27
  • 201
  • 249

3 Answers3

1

While the contents of the PI look rather like attributes, this is just a convention that the author of this document has adopted, it's not something defined by the XML spec and therefore it's not something supported in data models like DOM and XDM. They are sometimes called "pseudo-attributes".

You'll either have to parse them yourself by hand, or find a library that does it for you. Saxon has an XPath extension function saxon:get-pseudo-attribute(); other libraries may have something similar.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
1

Using python lxml module to read PI content, create an element as string and parsing it

>>> from lxml import etree
>>> tree = etree.parse("tmp.xml")
>>> pi = tree.xpath('//processing-instruction("foo")')
>>> pi[0].text
'class="abc" options="bar,baz"'
>>> root = etree.fromstring(f"<root {pi[0].text}/>")
>>> root.get('options')
'bar,baz'

Note: ElementTree skips processing instructions

LMC
  • 10,453
  • 2
  • 27
  • 52
-2

The string content of the processing instructions can theoretically be anything. In many cases though, it looks like an HTML element with attributes. To parse, one can construct an element as a string from it and parse that, e.g.:

import xml.etree.ElementTree as ET

for _, elem in ET.iterparse("data.xml", events=("pi",)):
    _elem = ET.fromstring(f"<{elem.text}/>")
    _elem.tag
    _elem.attrib
Nico Schlömer
  • 53,797
  • 27
  • 201
  • 249