2

Let's say I have an XML file like this one:

<books>
  <book>
    <title>John is alive</title>
    <abstract>
        A man is found alive after having disappeared for 10 years.
    </abstract>
    <description>
        <en> John disappeared 10 years ago. Lorem ipsum dolor sit amet ...</en>
        <fr> Il y a 10 ans, John disparaissait. Lorem ipsum dolor sit amet ...</fr>
    </description>
    <notes>First book in the series, where the character is introduced</notes>
  </book>
  <book>
    <title>The disappearance of John</title>
    <abstract>
        A prequel to the book "John is alive".
    </abstract>
    <description>
        <en> He lead an ordinary life, but then ... lorem ipsum dolor sit amet ...</en>
        <fr> Sa vie était tout à fait ordinaire, mais ... lorem ipsum dolor sit amet ...</fr>
    </description>
    <notes>Second book in the "John" series, but first in chronological order</notes>
  </book>
</books>

My question is simple: how can I, using XPATH, get a collection of all nodes that contain the word John?

Obviously, I can specify a series of nodes and that works fine:

(//title | //abstract | //description/* | //notes)[contains(lower-case(text()),"john")]

But if my XML grows (and it will!), with new elements being added at various levels in the structure, I don't want to constantly have to go back and adjust my XPATH.

What I fail to understand is why a generic statement like

//*[contains(lower-case(text()),"john")]

fails with this error message Required cardinality of first argument of lower-case() is one or zero.

Yet, not all statements with an asterisk fail.

For instance:

//books/book/*[contains(lower-case(text()),"john")] fails with the above error message

while

//books/book/*/*[contains(lower-case(text()),"john")] succeeds and retrieves both the <en> and <fr> nodes from the first <description> element

If it's not possible, fine, I will list all elements in my XPATH, but I still would like to get a clear understanding of the behavior of the * selector in the context of a contains() operation.

Filipus
  • 520
  • 4
  • 12
  • What's your XPath version and environment. Works for me in XPath2.0 https://cutt.ly/UjLvEtI – wp78de Jan 22 '21 at 18:47
  • Am using XPath 2.0, as evidenced by the lower-case() statements. But anyway, @kjhughes gave me the answer I was looking for. – Filipus Jan 23 '21 at 20:56

1 Answers1

3

There's some ambiguity regarding the term nodes (see XPath difference between child::* and child::node()) and the term contains (see How to use XPath contains() for specific text?) when being less than perfectly precise, but one of the following XPaths will likely meet your needs:

  1. All nodes whose string value contains the substring, "John":

    //node()[contains(.,"John")]
    
  2. All such elements:

    //*[contains(.,"John")]
    
  3. All such attributes:

    //@*[contains(.,"John")]
    
  4. All such text nodes:

    //text()[contains(.,"John")]
    
  5. All elements with text node children that contain the substring, "John":

    //*[text()[contains(.,"John")]]
    

Notice that #1 will include books, but #5 will exclude it. See Testing text() nodes vs string values in XPath.

You can replace contains(.,"John") with contains(lower-case(.),"john") in any of the above XPaths if you're using XPath 2.0. See also Case insensitive XPath contains() possible?

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Number 4 and/or Number 5 are exactly what I was looking for! I didn't know about the //text() construct, that's going to be very valuable in the future. Thanks! – Filipus Jan 23 '21 at 20:52