Parsing HTML from a defined start point to a defined end point?

Question

I have some HTML:

<hr noshade>
<p><a href="#1">Some text here</a></p>
<p style="margin-top:0pt;margin-bottom:0pt;line-height:120%;"><span style="color:#000000;font-weight:bold;">This is some description</span></p>
<hr noshade> <!-- so <hr noshade> is the delimiter for me -->
<p><a href="#2">Some more text here</a></p>
<p style="margin-top:0pt;margin-bottom:0pt;line-height:120%;"><span style="color:#000000;font-weight:bold;">This is description for some more text</span></p>
<hr noshade>

While parsing using nokogiri, I want to print information between each of these set of tags that are separated by my own delimiter <hr noshade>. So, the first block should print information between all "p" tags that lie between two hr noshade tags and so on.

score 1 · Accepted Answer · edited May 23 '17 at 12:28

1

I'm using the accepted answer on XPath select all elements between two specific elements

I only have a semi-safisfactory solution

You can use this XPath expression:

.//hr[1][@noshade]
  /following-sibling::*[not(self::hr[@noshade])]
                       [count(preceding-sibling::hr[@noshade])=1]

for the first group between <hr noshade> 1 and 2,

then,

.//hr[2][@noshade]
  /following-sibling::*[not(self::hr[@noshade])]
                       [count(preceding-sibling::hr[@noshade])=2]

for the elements between <hr noshade> 2 and 3, etc.

What these expressions select:

all siblings of an <hr noshade>, specified by its position N
that have only N <hr noshade> previous siblings, i.e. positionned in the N'th group
and that are not <hr noshade> themselves

As it will select several elements between 2 <hr noshade>, you may have to loop on the results and extract data for each sibling element.

Anyone on a more generic solution?

edited May 23 '17 at 12:28

Community

1
1

answered Sep 24 '13 at 17:46

paul trmbrth

20,518
4
53
66

Thanks for your reply. Yes, it makes some sense to me. I am now trying to imagine a more generic solution because the html file is auto generated by a software, so I wouldn't know the number of
's that it might generate. – Rohan Dalvi Sep 24 '13 at 18:00
So, I tried this: path = '//hr[1][@noshade]/following-sibling::* [not(self::hr[@noshade])][count(preceding-sibling::hr[@noshade])=1]' xpath = doc.xpath(path) But I get an error on it as, "unexpected ']' after 'equal' (Nokogiri::CSS::SyntaxError)" – Rohan Dalvi Sep 24 '13 at 18:11
CSS::SyntaxError?? I haven't tested with Nokogiri, only with Python's `lxml.html` – paul trmbrth Sep 24 '13 at 18:18
1

For the generic solution, you may count `
` first, and then generate as many XPath queries you need – paul trmbrth Sep 24 '13 at 18:22
what you tried is a valid XPath expression, I don't know why Nokogiri would want to interpret this as CSS selector – paul trmbrth Sep 24 '13 at 18:25
can you guide me with the xpath for counting the number of occurences of "
" tag? – Rohan Dalvi Sep 24 '13 at 19:06
1

`count(.//hr[@noshade])` should give you the number of `
` – paul trmbrth Sep 24 '13 at 19:46
so the hr[0] refers to the information between the first "hr noshade" tag and the second "hr noshade" tag, but it doesn't give me some information before the first "hr noshade" tag. – Rohan Dalvi Sep 25 '13 at 19:41
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/38059/discussion-between-paul-t-and-rohan-dalvi) – paul trmbrth Sep 25 '13 at 21:18

Parsing HTML from a defined start point to a defined end point?

1 Answers1