Parsing / combining nested HTML element values in the original order

Question

I would like to know how you can parse the content of an HTML block and at the same time sustain the order of the strings as they appear in the HTML document by using this (Hpple) wrapper which works with XPath expressions. The environment is iOS.

Example:

<html>
<body>
<div>
Lorem ipsum <a href="...">dolor</a> sit <b>amet,</b> consectetur
</div>
</body>
</html>

Let's say we want to parse all the strings inside the <div> tag in the original order so that we get this result:

Lorem ipsum dolor sit amet, consectetur

The sticking point of this is sustaining the order of strings. It's easy to get all the direct content of <div> as well as that of <a> and <b> seperately or at the same time using an XPath expression which however omits the order, so might result in putting the content of <a> and <b> in the end of the string.

How can you achieve this using an XPath expression with the mentioned wrapper?

Update:

One way to achieve this with the mentioned wrapper and platform (especially libxml2) seems to be the following XPath expression:

//div/descendant-or-self::*/text()

However the resulting elements are seperated and not delivered as one string so that they have to be concatenated manually.

Good question, +1. See my answer for a single XPath 1.0 expression that produces exactly the wanted text. — Dimitre Novatchev, Sep 08 '11 at 00:08

Dimitre Novatchev · Answer 1 · 2011-09-08T00:07:02.023

If Hpple is a compliant XPath emgine, then it must be able to evaluate this expression:

string(/*/body/div)

This XPath expression evaluates to the string value of the first (in document order /*/body/div element (in your case there is just one such element).

By definition, the string value of a node is the concatenation of all of its descendent text nodes (in document order) and thus this result is exactly the string you requested.

XSLT-based verification:

This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="text"/>

 <xsl:template match="/">
  <xsl:value-of select="/*/body/div"/>
 </xsl:template>
</xsl:stylesheet>

when applied on the provided XML document:

<html>
    <body>
        <div> Lorem ipsum 
            <a href="...">dolor</a> sit 
            <b>amet,</b> consectetur 
        </div>
    </body>
</html>

produces the wanted, correct result:

 Lorem ipsum 
            dolor sit 
            amet, consectetur

Thanks for your great answer. I am sure it is correct for XPath in general, however I am not able to get it to work using Hpple. I have found another way to achieve this and I will shortly post it as well. Maybe someone else knows if there is any prefix etc. required in Hpple for recognizing string functions? — SnuggleUp, Sep 08 '11 at 03:30

Parsing / combining nested HTML element values in the original order

1 Answers1