1

i want to select the following strings from this html using just lxml and some clever xpath. The strings will change but the surrounding html will not.

i need...

  • 19/11/2010
  • AAAAAA/01
  • Normal
  • United Kingdom
  • This description may contains <bold>html</bold> but i still need all of it!

from...

...
<p>
    <strong>Date:</strong> 19/11/2010<br>
    <strong>Ref:</strong> AAAAAA/01<br>
    <b>Type:</b> Normal<br>
    <b>Country:</b> United Kingdom<br>
</p>
<hr>
<p>
    <br>
    <b>1. Title:</b> The Title<br>
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br>
    <b>3. Date:</b> 25th October<br>
...

</p>

...

So far i've only come up with using regex expressions and re:match to try and drag it out, but even that won't work without something which enables me to get innerHTML of a the <p> nodes for exapmle.

is there any way to do this without post-processing the string through regex?

Thanks :)

significance
  • 4,797
  • 8
  • 38
  • 57

2 Answers2

2

Very ugly! With this properly wellformed input:

<html>
<p>
    <strong>Date:</strong> 19/11/2010<br/>
    <strong>Ref:</strong> AAAAAA/01<br/>
    <b>Type:</b> Normal<br/>
    <b>Country:</b> United Kingdom<br/>
</p>
<hr/>
<p>
    <br/>
    <b>1. Title:</b> The Title<br/>
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br/>
    <b>3. Date:</b> 25th October<br/>
</p>
</html>

Simplest case:

/html/p/strong[.='Date:']/following-sibling::text()[1]

Evaluate to:

 19/11/2010

All of those in one:

/html/p/*[self::strong[.='Date:' or .='Ref:']|
          self::b[.='Type:' or .='Country:']]
         /following-sibling::text()[1]

The complex one:

/html/p/node()[preceding-sibling::b[1][.='2. Description: ']]
              [following-sibling::b[1][.='3. Date:']]
              [not(self::br)]
  • do you have any ideas about getting "This description may contains html but i still need all of it!" where that string could possibly contain any html? Guessing that one might be impossible...! – significance Nov 19 '10 at 18:14
  • @significance: My last expression selects all those nodes. –  Nov 19 '10 at 18:52
0

This isn't so difficult.

Given this XML document:

<html> 
<p> 
    <strong>Date:</strong> 19/11/2010<br/> 
    <strong>Ref:</strong> AAAAAA/01<br/> 
    <b>Type:</b> Normal<br/> 
    <b>Country:</b> United Kingdom<br/> 
</p> 
<hr/> 
<p> 
    <br/> 
    <b>1. Title:</b> The Title<br/> 
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br/> 
    <b>3. Date:</b> 25th October<br/> 
</p> 
</html> 

i need...

  • 19/11/2010
  • AAAAAA/01
  • Normal
  • United Kingdom

this XPath expression selects all of the above text nodes:

/*/p[1]/text()
  • This description may contains html but i still need all of it!

Use this:

/*/p[2]/b[2]/following-sibling::node()
                 [count(.|/*/p[2]/b[2]/following-sibling::br[1]/preceding-sibling::node()) 
                = 
                  count((/*/p[2]/b[2]/following-sibling::br[1]/preceding-sibling::node()))
                 ]
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • +1 This will be the answer for a fixed schema. A minor: the "including" set could be shorted as `../b[3]/preceding-sibling::node()` –  Nov 19 '10 at 21:19
  • @Alejandro: What kind of "schema" are you talking about? For *this*? You must be joking. :) – Dimitre Novatchev Nov 20 '10 at 02:07