Why does XPath contains() select an unexpected node?

Question

I'm trying to find the correct XPath expression to get only URLs from all my documents, whatever the tag is. I'm trying with this one :

<urlset xmlns="https://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://url
    </loc>
    <lastmod>2019-08-07T15:01:51+00:00
    </lastmod>
  </url>
</urlset>

The following expression gives me these results :

//*[contains(.,'http')]//text()

https://url
2019-08-07T15:01:51+00:00

What I'm looking for is to get rid of the second line. I need to be able to get only URLs from any XML file.

Michael Kay · Answer 1 · 2022-01-04T17:34:28.357

1

Well, let's ignore the fact that not all URLs contain "http" and not everything that contains "http" is a URL...

To find all text nodes containing "http", just use //text()[contains(., 'http')].

edited Jan 04 '22 at 17:34

answered Jan 04 '22 at 16:34

Michael Kay

156,231
11
92
164

Could it be more reasonable if I use www ? – Hugo Jan 04 '22 at 16:51
I can't tell what's reasonable for you - you know your data, I don't. Detecting relative URIs like 'index.html' is difficult! Usually it's better to take advantage of the XML tagging rather than rely on matching the content. – Michael Kay Jan 04 '22 at 16:54
I'm ok with you. I just don't know in advance the tags I will find. But thank you btw it helped me. – Hugo Jan 04 '22 at 16:56

kjhughes · Accepted Answer · 2022-01-04T18:45:27.420

The reason that your XPath,

//*[contains(.,'http')]//text()

selects a surprise second result is that this XPath says to select all elements whose string-value contains an "http" substring, and return all descendant text nodes. These elements include not just the immediate parent element of the targeted text node but its ancestors as well:

The loc element, as you expected.
The urlset and url too, as you did not expect. (The urlset and url elements also have a 2019-08-07T15:01:51+00:00 descendant text node, and thus as part of their string-values.)

Alternatives to achieve desired result

Narrow the * all-elements wildcard to a single, named element:
```
//loc[contains(.,'http')]/text()
```
Narrow the * all-elements wildcard to multiple, named elements:
```
//*[(self::loc or self::e2) and contains(.,'http')]/text()
```
Select all text nodes containing the substring, "http" as noted by Michael Kay:
```
//text()[contains(., 'http')]
```

Why does XPath contains() select an unexpected node?

2 Answers2

Alternatives to achieve desired result

See also