0

[Python] I'm trying to retrieve any element in an XML document that has an href attribute, at any level of the XML document. For example:

<OuterElement href='a.com'>
  <InnerElement>
    <NestedInner href='b.com' />
    <NestedInner href='c.com' />
    <NestedInner />
  </InnerElement>
  <InnerElement href='d.com'/>
</OuterElement>

Would retrieve the following elements (as lxml element objects,simplified for visual clarity):

[<OuterElement href='a.com'>, <NestedInner href='b.com' />, <NestedInner href='c.com' />, <InnerElement href='d.com'/>]

I've tried using the following code to retrieve any element with an href tag, but it retrieves zero elements on a file full of elements with href attributes:

with(open(file, 'rb')) as f:
    xml_tree = etree.parse(f)
    href_elements = xml_tree.xpath(".//*[@href]")

Shouldn't this code select any element (.//*) with the specified attribute ([@href])? From my understanding (definitely correct me if I am wrong, I most likely am), href_elements should be an array of lxml element objects that each have an href attribute.

important clarification: I have seen many people asking about xpath on Stack Overflow, but I have yet to find a solved question about how to search through all elements in an xml and retrieve every element that fits a criteria (such as href).

martineau
  • 119,623
  • 25
  • 170
  • 301

1 Answers1

1

Based on ElementTree

import xml.etree.ElementTree as ET

xml = '''<OuterElement href='a.com'>
  <InnerElement>
    <NestedInner href='b.com' />
    <NestedInner href='c.com' />
    <NestedInner />
  </InnerElement>
  <InnerElement href='d.com'/>
</OuterElement>'''

root = ET.fromstring(xml)
elements_with_href = [root] if 'href' in root.attrib else []
elements_with_href.extend(root.findall('.//*[@href]'))
for e in elements_with_href:
  print(f'{e.tag} : {e.attrib["href"]}')

output

OuterElement : a.com
NestedInner : b.com
NestedInner : c.com
InnerElement : d.com
balderman
  • 22,927
  • 7
  • 34
  • 52
  • This worked, thank you. Can you link the documentation? – Cameron Gould Nov 17 '21 at 19:01
  • Just for anyone who comes through with this, I parse the XML tree from an xml file. So the line `root = ET.fromstring(xml)` gets replaced with `root = etree.parse(filename).getroot()` – Cameron Gould Nov 17 '21 at 19:22
  • 1
    @CameronGould, you can apply [`.iter()`](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.iter) as one-liner alternative: `[(e.tag, href) for e in root.iter() if (href := e.get("href"))]` – Olvin Roght Nov 17 '21 at 21:13