0

I have some HTML:

<hr noshade>
<p><a href="#1">Some text here</a></p>
<p style="margin-top:0pt;margin-bottom:0pt;line-height:120%;"><span style="color:#000000;font-weight:bold;">This is some description</span></p>
<hr noshade> <!-- so <hr noshade> is the delimiter for me -->
<p><a href="#2">Some more text here</a></p>
<p style="margin-top:0pt;margin-bottom:0pt;line-height:120%;"><span style="color:#000000;font-weight:bold;">This is description for some more text</span></p>
<hr noshade>

While parsing using nokogiri, I want to print information between each of these set of tags that are separated by my own delimiter <hr noshade>. So, the first block should print information between all "p" tags that lie between two hr noshade tags and so on.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Rohan Dalvi
  • 1,215
  • 1
  • 16
  • 38

1 Answers1

1

I'm using the accepted answer on XPath select all elements between two specific elements

I only have a semi-safisfactory solution

You can use this XPath expression:

.//hr[1][@noshade]
  /following-sibling::*[not(self::hr[@noshade])]
                       [count(preceding-sibling::hr[@noshade])=1]

for the first group between <hr noshade> 1 and 2,

then,

.//hr[2][@noshade]
  /following-sibling::*[not(self::hr[@noshade])]
                       [count(preceding-sibling::hr[@noshade])=2]

for the elements between <hr noshade> 2 and 3, etc.

What these expressions select:

  1. all siblings of an <hr noshade>, specified by its position N
  2. that have only N <hr noshade> previous siblings, i.e. positionned in the N'th group
  3. and that are not <hr noshade> themselves

As it will select several elements between 2 <hr noshade>, you may have to loop on the results and extract data for each sibling element.

Anyone on a more generic solution?

Community
  • 1
  • 1
paul trmbrth
  • 20,518
  • 4
  • 53
  • 66
  • Thanks for your reply. Yes, it makes some sense to me. I am now trying to imagine a more generic solution because the html file is auto generated by a software, so I wouldn't know the number of
    's that it might generate.
    – Rohan Dalvi Sep 24 '13 at 18:00
  • So, I tried this: path = '//hr[1][@noshade]/following-sibling::* [not(self::hr[@noshade])][count(preceding-sibling::hr[@noshade])=1]' xpath = doc.xpath(path) But I get an error on it as, "unexpected ']' after 'equal' (Nokogiri::CSS::SyntaxError)" – Rohan Dalvi Sep 24 '13 at 18:11
  • CSS::SyntaxError?? I haven't tested with Nokogiri, only with Python's `lxml.html` – paul trmbrth Sep 24 '13 at 18:18
  • 1
    For the generic solution, you may count `
    ` first, and then generate as many XPath queries you need
    – paul trmbrth Sep 24 '13 at 18:22
  • what you tried is a valid XPath expression, I don't know why Nokogiri would want to interpret this as CSS selector – paul trmbrth Sep 24 '13 at 18:25
  • can you guide me with the xpath for counting the number of occurences of "
    " tag?
    – Rohan Dalvi Sep 24 '13 at 19:06
  • 1
    `count(.//hr[@noshade])` should give you the number of `
    `
    – paul trmbrth Sep 24 '13 at 19:46
  • so the hr[0] refers to the information between the first "hr noshade" tag and the second "hr noshade" tag, but it doesn't give me some information before the first "hr noshade" tag. – Rohan Dalvi Sep 25 '13 at 19:41
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/38059/discussion-between-paul-t-and-rohan-dalvi) – paul trmbrth Sep 25 '13 at 21:18