2

How do I get The quick brown fox. in the following document:

<a>
   <b>
      Hello
      <c/>
      World
   </b>
   The quick brown fox.
</a>
StackOverflowNewbie
  • 39,403
  • 111
  • 277
  • 441
  • 1
    `/a/text()[2]` but this solution is not universal, works for this case only. – khachik Dec 07 '10 at 07:28
  • @khachik, I believe this is incorrect. There is only one child text node of `a`. – Yodan Tauber Dec 07 '10 at 07:58
  • @Yodan I'm sorry to ruin your belief about it, but `a` has two child text nodes - one is the whitespace between `` and `` and the second is the text needed to be extracted. – khachik Dec 07 '10 at 09:35
  • @khachik: Well, at least according to the .NET implementation (of `XmlNode.SelectNodes`), there is only one node that matches `/a/text()`. I think it ignores whitespace when there is *nothing but whitespace*, but includes whitespace when other characters are present. – Yodan Tauber Dec 07 '10 at 12:51
  • @Yodan DOM, SAX, XPath, XML have their own specifications which don't depend on .NET or another implementations. – khachik Dec 07 '10 at 12:55
  • @khachik and @Yodan: Whether white space only text nodes are preserve or striped from the tree depends on the XML tree provider of host language. Microsoft products strip them by default. –  Dec 07 '10 at 13:13
  • Naturally. I stand corrected (and I admin I have never read these specifications; my answers were based on .NET experience). One should apparently be careful when using the .NET implementation due to its quirks (somehow this starts to look like IE6). – Yodan Tauber Dec 07 '10 at 13:27

3 Answers3

4

As discussed in comments, when dealing with mixed content is important to know whether white space only text nodes are being preserved or stripped.

Universal solution:

/a/text()[normalize-space()][1]

Meaning: first not white space only text node child of a root element

Other posibility:

/a/text()[last()]

Meaning: last text node child of a root element

1

text() selects all child text nodes of the current node, so /a/text() is the way to go. Just remember that you may need to do some string manipulation on the results, because an XML like this one:

<a>
   <b>
      Hello
      <c/>
      World
   </b>
   The quick <!--comment--> brown fox.
</a>

will return two text nodes ("the quick" and "brown fox"). Also, the text values will contain whitespace (e.g. the newline after </b> and before "the").

Yodan Tauber
  • 3,907
  • 2
  • 27
  • 48
0

you can start with /a/text() This will get you just the node texts not the tags.

mariana soffer
  • 1,853
  • 12
  • 17