1

I'm trying to find the correct XPath expression to get only URLs from all my documents, whatever the tag is. I'm trying with this one :

<urlset xmlns="https://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://url
    </loc>
    <lastmod>2019-08-07T15:01:51+00:00
    </lastmod>
  </url>
</urlset>

The following expression gives me these results :

//*[contains(.,'http')]//text()
  1. https://url
  2. 2019-08-07T15:01:51+00:00

What I'm looking for is to get rid of the second line. I need to be able to get only URLs from any XML file.

Hugo
  • 13
  • 5

2 Answers2

1

Well, let's ignore the fact that not all URLs contain "http" and not everything that contains "http" is a URL...

To find all text nodes containing "http", just use //text()[contains(., 'http')].

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Could it be more reasonable if I use www ? – Hugo Jan 04 '22 at 16:51
  • I can't tell what's reasonable for you - you know your data, I don't. Detecting relative URIs like 'index.html' is difficult! Usually it's better to take advantage of the XML tagging rather than rely on matching the content. – Michael Kay Jan 04 '22 at 16:54
  • I'm ok with you. I just don't know in advance the tags I will find. But thank you btw it helped me. – Hugo Jan 04 '22 at 16:56
1

The reason that your XPath,

//*[contains(.,'http')]//text()

selects a surprise second result is that this XPath says to select all elements whose string-value contains an "http" substring, and return all descendant text nodes. These elements include not just the immediate parent element of the targeted text node but its ancestors as well:

  1. The loc element, as you expected.
  2. The urlset and url too, as you did not expect. (The urlset and url elements also have a 2019-08-07T15:01:51+00:00 descendant text node, and thus as part of their string-values.)

Alternatives to achieve desired result

  • Narrow the * all-elements wildcard to a single, named element:

    //loc[contains(.,'http')]/text()
    
  • Narrow the * all-elements wildcard to multiple, named elements:

    //*[(self::loc or self::e2) and contains(.,'http')]/text()
    
  • Select all text nodes containing the substring, "http" as noted by Michael Kay:

    //text()[contains(., 'http')]
    

See also

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Thank you. I already watched these topics but didn't found the right syntax I searched for (which is the last one). The problem I'm facing is that I'm not sure about the tags I will find on my xml document. That's why I need to search in the text whatever the tag is. And I'm not sure if your first and second alternatives work if there isn't any tag in my xml (and I'm totally curious to know why if I'm wrong). – Hugo Jan 05 '22 at 00:12
  • The first and second alternatives do presume the existence of the named elements. Keep in mind that well-designed markup obviates the need to pattern match on text as the whole point of markup is to identify document parts. – kjhughes Jan 05 '22 at 01:16