1
//*/text()[string-length() > 100]

...almost works, except it also selects script and style tags in the html document, and it stops text selection as it encounters a <br> or other tag.

enter image description here

I want to find elements that contain text directly, and the text is greater than 140 chars and text for that entire element should be selected (sometimes the text is further inside span).

kjhughes
  • 106,133
  • 27
  • 181
  • 240
eozzy
  • 66,048
  • 104
  • 272
  • 428

1 Answers1

3

You need to understand difference between text() nodes and string values in XPath.

  • text() selects text nodes in XPath. The br elements shown in your selection form mixed content in the parent element: text() nodes and elements mixed together.
  • string() is an XPath function that returns the string value of an XPath expression. To get a string that ignores the br elements, select the parent div and either directly take its string value via string() or implicitly get its string value by using the expression in a context where a conversion to string is implied.

With that background, your statement,

I want to find elements that contain text directly, and the text is greater than 140 chars and text for that entire element should be selected (sometimes the text is further inside span).

can be rephrased as

I want to find elements with text() node children and whose string value has a length greater than 140.

Let's look at some sample XML,

<r>
  <a>This is a <b>test</b> of mixed content.</a>
  <c>asdf asdf asdf asdf</c>
  <d>asdf asdf</d>
</r>

and let's reduce the 140 to 8 to make it more manageable, then

//*[text()][string-length() > 7]

captures the rephrased requirement and selects four elements:

<r>
  <a>This is a <b>test</b> of mixed content.</a>
  <c>asdf asdf asdf asdf</c>
  <d>asdf asdf</d>
</r>

<a>This is a <b>test</b> of mixed content.</a>

<c>asdf asdf asdf asdf</c>

<d>asdf asdf</d>

Notice that it did not select b because its string value's length is less than 7 characters.

Notice also that r is selected due to whitespace-only text() between the elements. To eliminate such elements, add an additional predicate to text():

//*[text()[normalize-space()]][string-length() > 7]

Then, only a, c, and d will be selected.

If you want text only, in XPath 1.0 you can collectively take the string value:

string(//*[text()[normalize-space()]][string-length() > 7])

If you want a collection of strings, in XPath 1.0, you'll need to iterate over the elements via the language calling XPath, but in XPath 2.0, you can add a string() step at the end:

//*[text()[normalize-space()]][string-length() > 7]/string()

to get a sequence of three separate strings:

This is a test of mixed content.
asdf asdf asdf asdf
asdf asdf
kjhughes
  • 106,133
  • 27
  • 181
  • 240