Extracting and grouping all text nodes using Xpath 2.0

Question

I would like to extract all text from subnodes of a specific document, AND return a text array. I think it would be easier to show it in an example:

given document:

<root>
    <div>
        some text
        <p>some other text</p>
    </div>

    <div>
        another text
        <b>yet another text <em>even more</em></b>
        end of text
    </div>
</root>

I would like to construct an expression which returns TWO elements:

 [0] some text someother text
 [1] another text yet another text even more end of text

I have tried many expressions but i seem to be missing something here, it is easy to extract div's alone (just //div) but how to group them and join all text() subnodes in every div separately?

score 1 · Accepted Answer · edited May 23 '17 at 10:34

1

text() is your friend here:

You have to do this in two steps.

//div

then:

//text()

And then programmatically merge them.

XPath is a query language, just like CSS selectors and cannot transform things. All the functions (like normalize-text) are there to refine your selector not to modify the input itself.

See: how to get the normalize-space() xpath function to work?

edited May 23 '17 at 10:34

Community

1
1

answered Jan 14 '12 at 14:31

greut

4,305
1
30
49

Nope. //div/text() will return more nodes since first div has at least 1 text node, and second has at least 2 text nodes. //div//text() will return even more nodes. The expression i'm looking for should concat all text() nodes in each div separately. Something like //div/concat(.//text()), but it does not work of course. – Pma Jan 14 '12 at 14:41
Also i am using pure XPATH in a java application, i cannot postprocess using XSLT, therefore i;m looking for pure XPATH solution – Pma Jan 14 '12 at 14:44
They are no pure XPath solutions here, I'm sorry. Think of XPath as CSS selectors… they are selectors only not transformers. – greut Jan 14 '12 at 15:04

score 1 · Answer 2 · answered Jan 14 '12 at 14:43

1

With XPath 2.0 (and assuming your input is well-formed with some added </b>) you can use a path like /root/div/normalize-space() which gives you a sequence of two strings "some text some other text" and "another text yet another text even more end of text".

answered Jan 14 '12 at 14:43

Martin Honnen

160,499
6
90
110

I have tried using this expression in a test java application with Saxon 9. Unfortunately, there is a problem setting returnType. from evaluate() method. If i set type to XpathConstants.STRING i only get the first String value "some textsome other text", co i think the expression should work. But how to mark the return type as a "String array"? Return type of XpathConstants.NODESET does not work since we are dealing with String nodes... – Pma Jan 14 '12 at 14:59
̀`normalize-space()` is a function not a selector. – greut Jan 14 '12 at 15:01
The problem with the return type is that you are using the JAXP API, which has never been extended for XPath 2.0, so it doesn't allow you to request a result comprising a sequence of strings. Use Saxon's s9api interface instead. – Michael Kay Jan 14 '12 at 15:36
greut, I know that `normalize-space` is a function but the poster of the questions mentioned XPath 2.0 and in XPath 2.0 you can use a function call in the last step of a path expression. I think my suggestion solves the problem as far as XPath 2.0 can (it returns a sequence of two strings, not an array of two strings). – Martin Honnen Jan 14 '12 at 17:25

score 0 · Answer 3 · answered Jan 14 '12 at 15:35

XPath cannot construct new nodes: for that you need XSLT or XQuery. So an expression can never return an element which is not present in your source document. However, with XPath 2.0 you can easily enough return two strings: except for minor whitespace details, you can get the required result from the expression //div/normalize-space(.)

Extracting and grouping all text nodes using Xpath 2.0

3 Answers3