1

CONTEXT

Supposing the following HTML

....
<p>Whatever</p>
<div>Whatever DIV78232 Everwhat</div>
....

Question:

How could I return a plain text string containing DIVnnnnn, where nnnnn represents any digits.

My investigation so far:

The xPath replace() function will replace a pattern found inside the current DOM.

replace(.,'.*?(DIV\d+).*','$1') => DIV78232

Why am I blocked?

Because I want the query to return the "DIV78232" as a string, without actually replacing it in the DOM at all, just as it would return "Whatever" for the query /p/text() [I am trying all this on the FirePath firefox-extension]

Note: According to the official DOCS

"replace() Returns the value of the first argument with every substring matched by the regular expression that is the value of the second argument replaced by the replacement string that is the value of the third argument."

FINAL PURPOSE:

My final purpose is to get the (string) IMAGE URL that matches '.*?image:.*?"(.+?)".*' from this (which is inside the HTML):

In this case, the query //*[matches(.,'.*?image:.*?"(.+?)".*','i')] returns the whole node, but I only want the first Capturing Group, which would be the IMAGE URL.

<script>...vp&output=xml_vast2&unviewed_position_start=1&
url='+encodeURIComponent(location.href)+'
description_url='+encodeURIComponent(location.href)+'&
image:   "https://domain.com/xxxxxxx/public_images/2015.12/article/56797be1c46188ac438b45c3.jpg", // stretching: 'fi..</script>
Community
  • 1
  • 1
Álvaro N. Franz
  • 1,188
  • 3
  • 17
  • 39
  • What do you mean with plain text node? AFAICS it's an attribute value you are looking for? – PeeHaa Dec 22 '15 at 21:07
  • @PeeHaa Thanks for your time. The question has an example that I reduced, but the final purpose is a bit more than that. I just updated the question with the specific purpose. – Álvaro N. Franz Dec 22 '15 at 21:13

1 Answers1

0

Took me a long while, but this is the result I got by combinating replace() and tokenize()

tokenize(replace(.,'.*?image:.*?"(.+?)".*?',':@:$1:@:'),':@:')[2]

Returns the image URL in the snippet above mentioned.

Why/How does this work?

  • Replace() matches the image and wraps the capturing group with my own token separator ':@:' (Could be anything original)
  • Tokenize() splits the replaced string in 3 parts, being the second one the capturing group I was looking for. (It will be three parts because it is highly improbable that the document will contain ':@:' anywhere else)

Is there any faster way to achieve this?

Thanks. All the best. Peace.

Álvaro N. Franz
  • 1,188
  • 3
  • 17
  • 39