2

I am using Xpath to scrape a website (legitimately for once!!) thanks to the amazing powers of Visual Web Ripper.

One of the fields of content I need to be able to get is the P tag contents following a H3 tag. Now this is fine if I want the next one I can use the following code:

//DIV[@id='content']/H3[. = 'Prices']/following-sibling::P[1]

But how can I say I want the content of all P tags up-until the next H3?

Anthony Main
  • 6,039
  • 12
  • 64
  • 89
  • possible duplicate of [XPath : select all following siblings until another sibling](http://stackoverflow.com/questions/2161766/xpath-select-all-following-siblings-until-another-sibling) – glmxndr Apr 22 '11 at 07:49
  • Good question, +1. See my answer for a complete solution based on a general formula for node-set intersection. – Dimitre Novatchev Apr 22 '11 at 13:02
  • @tigermain - i am trying to do the same thing. How do you use the xpath from vw-ripper in php? – Imran Omar Bukhsh Apr 17 '12 at 14:02

3 Answers3

1

Use:

//div[@id='content']/h3[. = 'Prices']
  /following-sibling::p
    [count
      (. | 
       //div[@id='content']
              /h3[. = 'Prices']/following-sibling::h3/preceding-sibling::p
      )
     =
     count
      (
       //div[@id='content']
             /h3[. = 'Prices']/following-sibling::h3/preceding-sibling::p
       )
      ]

Here we use the Kayessian formula for intersection of two nodesets $ns1 and $ns2:

$ns1[count(.|$ns2) = count($ns2)]
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
0

With Visual Web Ripper you can use the non-standard function SPAN which includes all siblings nodes until encountering the element specified.

Try :

//DIV[@id='content']/H3[. = 'Prices']/following-sibling::P[SPAN('H3')]
-1

Thanks for your feedback and input guys but I found an event easier/quicker/tidier way of doing it (comments welcome)

//DIV[@id='content']/H3[. = 'Prices']/following-sibling::P[./preceding-sibling::H3[1][. = 'Prices']]
John Saunders
  • 160,644
  • 26
  • 247
  • 397
Anthony Main
  • 6,039
  • 12
  • 64
  • 89
  • @tigerman: this is not a reliable and general solution. Here it is applicable only because the `H3` element is uniquely identified by its string value. Were there more than one `H3` elements with the same string value, this solution might not select the desired nodes. At the same, the solution that I provided selects always the expected nodes. You may benefit from this solution if you wish to learn. – Dimitre Novatchev Apr 23 '11 at 02:35
  • @tigerman: Also note that XPath (and XML) is case-sensitive and in your question you are mixing cases (`p` and `P`) which makes the statements contained in the question false. It would be good if you correct your question. I would recommend that you pay more attention to learning XPath and XML. – Dimitre Novatchev Apr 23 '11 at 02:39
  • I appreciate your concerns, in my case this is not an issue, I am actually using the text values of the H3 element as unique identifiers – Anthony Main Apr 23 '11 at 09:42
  • @tigerman: In all such cases you must list your assumptions in the question itself. – Dimitre Novatchev Apr 23 '11 at 15:42