You need to understand difference between text() nodes and string values in XPath.
text()
selects text nodes in XPath. The br
elements shown in
your selection form mixed content in the parent element: text()
nodes and elements mixed together.
string()
is an XPath function that returns the string value of an XPath expression. To get a string that ignores the br
elements, select the
parent div
and either directly take its string value via string()
or implicitly get its string value by using the expression in a
context where a conversion to string is implied.
With that background, your statement,
I want to find elements that contain text directly, and the text is
greater than 140 chars and text for that entire element should be
selected (sometimes the text is further inside span).
can be rephrased as
I want to find elements with text()
node children and whose string value has a length greater than 140.
Let's look at some sample XML,
<r>
<a>This is a <b>test</b> of mixed content.</a>
<c>asdf asdf asdf asdf</c>
<d>asdf asdf</d>
</r>
and let's reduce the 140 to 8 to make it more manageable, then
//*[text()][string-length() > 7]
captures the rephrased requirement and selects four elements:
<r>
<a>This is a <b>test</b> of mixed content.</a>
<c>asdf asdf asdf asdf</c>
<d>asdf asdf</d>
</r>
<a>This is a <b>test</b> of mixed content.</a>
<c>asdf asdf asdf asdf</c>
<d>asdf asdf</d>
Notice that it did not select b
because its string value's length is less than 7 characters.
Notice also that r
is selected due to whitespace-only text()
between the elements. To eliminate such elements, add an additional predicate to text()
:
//*[text()[normalize-space()]][string-length() > 7]
Then, only a
, c
, and d
will be selected.
If you want text only, in XPath 1.0 you can collectively take the string value:
string(//*[text()[normalize-space()]][string-length() > 7])
If you want a collection of strings, in XPath 1.0, you'll need to iterate over the elements via the language calling XPath, but in XPath 2.0, you can add a string()
step at the end:
//*[text()[normalize-space()]][string-length() > 7]/string()
to get a sequence of three separate strings:
This is a test of mixed content.
asdf asdf asdf asdf
asdf asdf
or other tag* - should tags also be captured within a text ? – RomanPerekhrest Dec 10 '16 at 12:35