how can i select these elements from the following horrible html using xpath and lxml?

Question

i want to select the following strings from this html using just lxml and some clever xpath. The strings will change but the surrounding html will not.

i need...

19/11/2010
AAAAAA/01
Normal
United Kingdom
This description may contains <bold>html</bold> but i still need all of it!

from...

...
<p>
    <strong>Date:</strong> 19/11/2010<br>
    <strong>Ref:</strong> AAAAAA/01<br>
    <b>Type:</b> Normal<br>
    <b>Country:</b> United Kingdom<br>
</p>
<hr>
<p>
    <br>
    <b>1. Title:</b> The Title<br>
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br>
    <b>3. Date:</b> 25th October<br>
...

</p>

...

So far i've only come up with using regex expressions and re:match to try and drag it out, but even that won't work without something which enables me to get innerHTML of a the <p> nodes for exapmle.

is there any way to do this without post-processing the string through regex?

Thanks :)

Good question, +1. See my answer for concrete XPath expressions. :) — Dimitre Novatchev, Nov 19 '10 at 18:58

score 2 · Accepted Answer · answered Nov 19 '10 at 17:24

Very ugly! With this properly wellformed input:

<html>
<p>
    <strong>Date:</strong> 19/11/2010<br/>
    <strong>Ref:</strong> AAAAAA/01<br/>
    <b>Type:</b> Normal<br/>
    <b>Country:</b> United Kingdom<br/>
</p>
<hr/>
<p>
    <br/>
    <b>1. Title:</b> The Title<br/>
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br/>
    <b>3. Date:</b> 25th October<br/>
</p>
</html>

Simplest case:

/html/p/strong[.='Date:']/following-sibling::text()[1]

Evaluate to:

 19/11/2010

All of those in one:

/html/p/*[self::strong[.='Date:' or .='Ref:']|
          self::b[.='Type:' or .='Country:']]
         /following-sibling::text()[1]

The complex one:

/html/p/node()[preceding-sibling::b[1][.='2. Description: ']]
              [following-sibling::b[1][.='3. Date:']]
              [not(self::br)]

do you have any ideas about getting "This description may contains html but i still need all of it!" where that string could possibly contain any html? Guessing that one might be impossible...! — significance, Nov 19 '10 at 18:14

Dimitre Novatchev · Answer 2 · 2010-11-19T19:03:04.897

This isn't so difficult.

Given this XML document:

<html> 
<p> 
    <strong>Date:</strong> 19/11/2010<br/> 
    <strong>Ref:</strong> AAAAAA/01<br/> 
    <b>Type:</b> Normal<br/> 
    <b>Country:</b> United Kingdom<br/> 
</p> 
<hr/> 
<p> 
    <br/> 
    <b>1. Title:</b> The Title<br/> 
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br/> 
    <b>3. Date:</b> 25th October<br/> 
</p> 
</html>

i need...

19/11/2010

AAAAAA/01

Normal

United Kingdom

this XPath expression selects all of the above text nodes:

/*/p[1]/text()

This description may contains html but i still need all of it!

Use this:

/*/p[2]/b[2]/following-sibling::node()
                 [count(.|/*/p[2]/b[2]/following-sibling::br[1]/preceding-sibling::node()) 
                = 
                  count((/*/p[2]/b[2]/following-sibling::br[1]/preceding-sibling::node()))
                 ]

+1 This will be the answer for a fixed schema. A minor: the "including" set could be shorted as `../b[3]/preceding-sibling::node()` — , Nov 19 '10 at 21:19
@Alejandro: What kind of "schema" are you talking about? For *this*? You must be joking. :) — Dimitre Novatchev, Nov 20 '10 at 02:07

how can i select these elements from the following horrible html using xpath and lxml?

2 Answers2