1

I'm new to XPath and am working with an XML file which looks like this:

<doc>
    <component>
        <author> Bob </author>
    </component>
    
    <component>
        <sB>
            <component>
                <section ID='S1'>
                    <title>Some s1 title</title>
                </section>
            </component>
            <component>
                <section ID='S2'>
                    <title>Some s2 title</title>
                </section>
            </component>
        </sB>
    </component>
</doc>

I want to retrieve the component item above with section ID = S1, or alternatively the one that has a title element with text 'Some s1 title'. I cannot count on these things being in a particular order.

So far I've tried

import xml.etree.ElementTree as ET

tree = ET.parse('test.xml')
res = tree.getroot().findall(".//*[title='Some s1 title']../../")
for i in res:
    ET.dump(i)

but that gets me both components, not just the one with the matching title.

I've also tried to search at the section ID level, like so:

res = tree.getroot().findall(".//*section[@ID='S1']/../")
for i in res:
    ET.dump(i)

but that doesn't get me the parent (the whole component) and instead just gets me the section.

Both of these seem like they might work from the simple example syntax I've seen online, but clearly in both cases I'm missing some understanding of what is actually happening. Could someone please clarify what is happening here and why I'm not getting what I would expect?

kjhughes
  • 106,133
  • 27
  • 181
  • 240
gammapoint
  • 1,083
  • 2
  • 15
  • 27
  • 1
    the xpath to get the component above section 'D1' is `//section[@ID='S1']/..` – derloopkat Oct 25 '21 at 19:34
  • Huh, that does work. Thanks Daniel. I didn't realize that last `/` would change the meaning. And that `*` I put in front of section wasn't important either, apparently (but didn't hurt either). Can you elaborate on that versus the one with the `*` I put in front? – gammapoint Oct 25 '21 at 19:38

2 Answers2

2

Craft your XPath expression to select component and then use the predicate (the conditions inside the square brackets) to determine which components you want. Such as:

component containing section with ID = 'S1'

//component[./section[@ID='S1']]

or component containing section/title = 'Some s1 title'

//component[./section/title/text() = 'Some s1 title']

or component containing section with ID = 'S1' and that section has title = 'Some s1 title'

//component[./section[@ID='S1']/title/text() = 'Some s1 title']

and other variations thereof are possible.

David Denenberg
  • 730
  • 4
  • 7
  • 1
    You do not need the `./` before `section[..` . The expression `//component[section[@ID='S1']] ` is enough. – zx485 Oct 25 '21 at 20:41
  • Also, [ElementTree has very limited support for XPath](https://docs.python.org/3/library/xml.etree.elementtree.html#supported-xpath-syntax). Some/all of your examples may not work; even if they're valid and correct XPaths. – Daniel Haley Oct 25 '21 at 22:40
2

There are syntax errors with both of your XPaths:

  1. .//*[title='Some s1 title']../../ is missing an / after the predicate. Then this one overshoots upward anyway.

  2. .//*section[@ID='S1']/../ cannot have a * before section. This one would work otherwise.

But rather than repairing and working from there, you don't really need to select along the parent or ancestor axis — better to use a predicate higher in the hierarchy anyway...


This XPath,

//component[section/@ID='S1']

selects the component elements with section children with id attribute value equal to 'S1'.


This XPath,

//component[section/title='Some s1 title']

selects the component elements with section children with title children with a string value equal to 'Some s1 title'.


Notes on Python XPath library quarks:

  • ElementTree: Noncompliant. Avoid.
  • lxml: Use xpath() rather than findall().

See also

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Thanks, although unless I'm doing something stupid this doesn't work. When I try: ```res = tree.getroot().findall("//component[section/@ID='S1']") for i in res: ET.dump(i) ``` I get: `SyntaxError: cannot use absolute path on element`. Did I do something incorrect here? – gammapoint Oct 27 '21 at 17:02
  • The XPath's are correct, but if you want to use them in your Python / ElementTree code as is, try making them relative (`.//`) rather than absolute (`//`). – kjhughes Oct 27 '21 at 17:41
  • Yeah, I tried that too, using `res = tree.getroot().findall(".//component[section/@ID='S1']") ` but that gives me `SyntaxError: invalid predicate`. I've tested your XPaths with browser based tools and they work fine, but just can't figure out why this isn't working in Python. – gammapoint Oct 27 '21 at 17:48
  • Damn ElementTree. As @DanielHaley mentions in another comment, [ElementTree is not fully compliant with the XPath standard](https://docs.python.org/3/library/xml.etree.elementtree.html#supported-xpath-syntax). I advise you to switch to a better library [such as libxml](https://stackoverflow.com/q/8692/290085). I'm certainly not going to waste time dancing around faulty library support for a standard, and I suggest you do not either. – kjhughes Oct 27 '21 at 17:59
  • I tried lxml already and it's telling me the same thing, unfortunately, which makes me wonder if it's really the library. Though as I said, your Xpath does seem to work in browser testers so I dunno... – gammapoint Oct 27 '21 at 18:02
  • The XPaths are correct to the standard. You're not the first to suffer these annoyances, but you should have better luck with lxml than with ElementTree. [This Q/A](https://stackoverflow.com/a/65030511/290085) is similar and should help. – kjhughes Oct 27 '21 at 18:08
  • 1
    Okay, I figured it out I think. Apparently lxml's findall implementation is just as bad as ElementTree's (https://stackoverflow.com/a/36674153/1961582). If I instead search with lxml using the .xpath method it seems to work. – gammapoint Oct 27 '21 at 18:24
  • 1
    @gammapoint: Glad you got it working. Updated answer to include mention of python library quarks for benefit of future readers. – kjhughes Oct 27 '21 at 18:39