XPath - how to select text

Question

How do I get The quick brown fox. in the following document:

<a>
   <b>
      Hello
      <c/>
      World
   </b>
   The quick brown fox.
</a>

`/a/text()[2]` but this solution is not universal, works for this case only. — khachik, Dec 07 '10 at 07:28
@khachik, I believe this is incorrect. There is only one child text node of `a`. — Yodan Tauber, Dec 07 '10 at 07:58
@Yodan I'm sorry to ruin your belief about it, but `a` has two child text nodes - one is the whitespace between `` and `` and the second is the text needed to be extracted. — khachik, Dec 07 '10 at 09:35
@khachik: Well, at least according to the .NET implementation (of `XmlNode.SelectNodes`), there is only one node that matches `/a/text()`. I think it ignores whitespace when there is *nothing but whitespace*, but includes whitespace when other characters are present. — Yodan Tauber, Dec 07 '10 at 12:51
@Yodan DOM, SAX, XPath, XML have their own specifications which don't depend on .NET or another implementations. — khachik, Dec 07 '10 at 12:55
@khachik and @Yodan: Whether white space only text nodes are preserve or striped from the tree depends on the XML tree provider of host language. Microsoft products strip them by default. — , Dec 07 '10 at 13:13
Naturally. I stand corrected (and I admin I have never read these specifications; my answers were based on .NET experience). One should apparently be careful when using the .NET implementation due to its quirks (somehow this starts to look like IE6). — Yodan Tauber, Dec 07 '10 at 13:27

score 4 · Accepted Answer · answered Dec 07 '10 at 13:18

4

As discussed in comments, when dealing with mixed content is important to know whether white space only text nodes are being preserved or stripped.

Universal solution:

/a/text()[normalize-space()][1]

Meaning: first not white space only text node child of a root element

Other posibility:

/a/text()[last()]

Meaning: last text node child of a root element

answered Dec 07 '10 at 13:18

I'd make that predicate `[normalize-space(.) != '']` to make it more explicit. – Robert Rossney Dec 07 '10 at 18:58

score 1 · Answer 2 · answered Dec 07 '10 at 07:55

text() selects all child text nodes of the current node, so /a/text() is the way to go. Just remember that you may need to do some string manipulation on the results, because an XML like this one:

<a>
   <b>
      Hello
      <c/>
      World
   </b>
   The quick <!--comment--> brown fox.
</a>

will return two text nodes ("the quick" and "brown fox"). Also, the text values will contain whitespace (e.g. the newline after </b> and before "the").

score 0 · Answer 3 · answered Dec 07 '10 at 07:47

0

you can start with /a/text() This will get you just the node texts not the tags.

answered Dec 07 '10 at 07:47

mariana soffer

1,853
12
17

XPath - how to select text

3 Answers3

Linked