1

Here is the HTML code:

<div id="someid">
    <h2>Specific text 1</h2>
    <a class="hyperlinks" href="link"> link1 inside specific text 1</a>
    <a class="hyperlinks" href="link"> link2 inside specific text 1</a>
    <a class="hyperlinks" href="link"> link3 inside specific text 1</a>

    <h2>Specific text 2</h2>
    <a class="hyperlinks" href="link"> link1 inside specific text 2</a>
    <a class="hyperlinks" href="link"> link2 inside specific text 2</a>
    <a class="hyperlinks" href="link"> link3 inside specific text 2</a>
    <a class="hyperlinks" href="link"> link4 inside specific text 2</a>

    <h2>Specific text 3</h2>
    <a class="hyperlinks" href="link"> link1 inside specific text 3</a>
    <a class="hyperlinks" href="link"> link2 inside specific text 3</a>         

</div>  

I have to distinctly find links under each "Specific text". The problem is that if I write the following code in python:

links = root.xpath("//div[@id='someid']//a")
for link in links:
    print link.attrib['href']

It prints ALL the links irrespective of "Specific Text x", Whereas I want something like:

print "link under Specific text:"+specific+" link:"+link.attrib['href']

Please suggest

jerrymouse
  • 16,964
  • 16
  • 76
  • 97
  • So, what is the exact output you want based on the provided XML document? This isn't clear. Please, edit your question and add this requirement. – Dimitre Novatchev Aug 25 '11 at 12:32

2 Answers2

1

I think you will need one XPath expression for each h2 specific text.

Given an h2 specific text, you can get its following adjacent a siblings by:

    //div[@id='someid']/h2[.='Specific text 1']
     /following-sibling::a[
      count( . | following-sibling::h2[1]/preceding-sibling::*)
      = count(following-sibling::h2[1]/preceding-sibling::*)
      and preceding-sibling::h2[1][.='Specific text 1']]
    |
    //div[@id='someid']/h2[.='Specific text 1' and not(following-sibling::h2[1])]
    /following-sibling::a"

The second //h2 selection handles the case where h2 is the last one.

The expression above just exploits the XPath 1.0 intersection formula:

$ns1[count(.|$ns2)=count($ns2)]

You can find a lot of resources about this method, lot of answers here at SO (check my answers also). I think it's not difficult to understand how to apply this formula, what is difficult is to understand when it must be applied.

Credits for the formul goes to @Michael Key. Just google it a bit.

My expression has been extended with additional predicates to handle your specific case and unified (|) with additional expression to handle last h2.

Emiliano Poggi
  • 24,390
  • 8
  • 55
  • 67
  • Thanks for the answer. The output is: `link1 inside specific text 1 link2 inside specific text 1 link3 inside specific text 1 link1 inside specific text 2 link2 inside specific text 2 link3 inside specific text 2 link4 inside specific text 2` . Hence its also including links from specific text 2. Whereas I want **only** specific text 1 anchors text. – jerrymouse Aug 25 '11 at 11:36
  • Also, if possible, provide some learning links for xpath mentioning this kind of advanced methods. – jerrymouse Aug 25 '11 at 11:40
  • the output of: `links = root.xpath("//div[@id='someid']/h2[.='Specific text 1']/following-sibling::a[count( .| following-sibling::h2[1]/preceding-sibling::*) = count(following-sibling::h2[1]/preceding-sibling::*)] | //div[@id='someid']/h2[.='Specific text 1' and not(following-sibling::h2[1])] /following-sibling::a") for link in links: print link.xpath("string()")` – jerrymouse Aug 25 '11 at 11:42
  • I tweaked your code to get it working! Thanks a ton! However, I would be grateful if you can slightly explain the **working** of this code... Like why is `.|` used in count() etc – jerrymouse Aug 25 '11 at 11:53
  • I've made a test (sorry :P) and there was a flow. Now corrected in my answer. Also provided additional reference information. – Emiliano Poggi Aug 25 '11 at 12:03
  • 1
    Please accept/upvote the answer if you think it fulfills your question or was useful. – Emiliano Poggi Aug 25 '11 at 12:04
  • You also find a recent entry [here](http://stackoverflow.com/questions/7178471/how-to-perform-set-operations-in-xpath-1-0). Cheers – Emiliano Poggi Aug 25 '11 at 13:02
0

You could use the starts-with(s, t) function of XPath 2.0 to build a matching condition of a h2-value.

//div/h2[starts-with(text(), 'Specific text')]//a

I don't know any XPath 2.0 implementation for Python. So this will probably not work. But perhaps you can change the condition for your needs.

  • Thanks for the reply. Can you please be a little more specific to the code here? – jerrymouse Aug 25 '11 at 10:55
  • Unfortunately that isn't working. The reason being that 'a' is not a child of h2, rather a sibling of h2. This method would have worked if all the anchors were inside h2 though. – jerrymouse Aug 25 '11 at 11:10
  • Sorry, got confused by your indenting :) –  Aug 25 '11 at 11:11
  • **a** nodes are following sibling of **h2** elements and not descendant. Your XPath will not select any **a** node currently. Also `starts-with()` is not the right approach. – Emiliano Poggi Aug 25 '11 at 11:12