7

I have a big HTML page. But I want to select certain nodes using Xpath:

<html>
 ........
<!-- begin content -->
 <div>some text</div>
 <div><p>Some more elements</p></div>
<!-- end content -->
.......
</html>

I can select HTML after the <!-- begin content --> using:

"//comment()[. = ' begin content ']/following::*" 

Also I can select HTML before the <!-- end content --> using:

"//comment()[. = ' end content ']/preceding::*" 

But do I have to have XPath to select all the HTML between the two comments?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
xecutioner
  • 311
  • 3
  • 15

1 Answers1

20

I would look for elements that are preceded by the first comment and followed by the second comment:

doc.xpath("//*[preceding::comment()[. = ' begin content ']]
              [following::comment()[. = ' end content ']]")
#=> <div>some text</div>
#=> <div>
#=>   <p>Some more elements</p>
#=> </div>
#=> <p>Some more elements</p>

Note that the above gives you each element in between. This means that if you iterate through each the returned nodes, you will get some duplicated nested nodes - eg the "Some more elements".

I think you might actually want to just get the top-level nodes in between - ie the siblings of the comments. This can be done using the preceding/following-sibling instead.

doc.xpath("//*[preceding-sibling::comment()[. = ' begin content ']]
              [following-sibling::comment()[. = ' end content ']]")
#=> <div>some text</div>
#=> <div>
#=>   <p>Some more elements</p>
#=> </div>

Update - Including comments

Using //* only returns element nodes, which does not include comments (and some others). You could change * to node() to return everything.

puts doc.xpath("//node()[preceding-sibling::comment()[. = 'begin content']]
                        [following-sibling::comment()[. = 'end content']]")
#=> 
#=> <!--keywords1: first_keyword-->
#=> 
#=> <div>html</div>
#=> 

If you just want element nodes and comments (ie not everything), you can use the self axis:

doc.xpath("//node()[self::* or self::comment()]
                   [preceding-sibling::comment()[. = 'begin content']]
                   [following-sibling::comment()[. = 'end content']]")
#~ #=> <!--keywords1: first_keyword-->
#~ #=> <div>html</div>
Justin Ko
  • 46,526
  • 5
  • 91
  • 101
  • 1
    Very good answer as usual... :) The last part is the one..probably OP was looking for. – Arup Rakshit Sep 18 '13 at 14:28
  • For the case `hello!
    html
    ` It only gives back `
    html
    ` any insights into this?. As i need the comment with keywords returned as well.
    – xecutioner Sep 24 '13 at 10:22
  • The problem is that `//*` select all _element_ nodes. This does not include comments and some others (see this other [question](http://stackoverflow.com/questions/5643323/get-xpath-for-all-the-nodes)). Answer updated. – Justin Ko Sep 24 '13 at 13:08
  • Thanks a lot @JustinKo For the update with comments. – xecutioner Dec 30 '13 at 08:17