2

I have some HTML pages where the contents to be extracted are marked with HTML comments like below.

<html>
 .....
<!-- begin content -->
 <div>some text</div>
 <div><p>Some more elements</p></div>
<!-- end content -->
...
</html>

I am using Nokogiri and trying to extract the HTML between the <!-- begin content --> and <!-- end content --> comments.

I want to extract the full elements between these two HTML comments:

<div>some text</div>
<div><p>Some more elements</p></div>

I can get the text-only version using this characters callback:

class TextExtractor < Nokogiri::XML::SAX::Document

  def initialize
    @interesting = false
    @text = ""
    @html = ""
  end

  def comment(string)
    case string.strip        # strip leading and trailing whitespaces
    when /^begin content/      # match starting comment
      @interesting = true
    when /^end content/
    @interesting = false   # match closing comment
  end

  def characters(string)
    @text << string if @interesting
  end

end

I get the text-only version with @text but I need the full HTML stored in @html.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
xecutioner
  • 311
  • 3
  • 15
  • Give more part of HTML, and the expected out. question should be more explicit. – Arup Rakshit Sep 18 '13 at 10:01
  • I have re-edited the question. Basically i want to scrap out the html between the two html comments and . I am able to get to the only text version of the contents without the html tags using the characters callback but have no idea on how i can store the html . – xecutioner Sep 18 '13 at 11:38
  • See also http://stackoverflow.com/questions/820066/nokogiri-select-content-between-element-a-and-b?rq=1 – Mark Thomas Sep 18 '13 at 19:59
  • possible duplicate of [Xpath to select between two html comments](http://stackoverflow.com/questions/18871618/xpath-to-select-between-two-html-comments) – Phrogz Sep 19 '13 at 03:10

1 Answers1

6

Extracting content between two nodes is not a normal thing we'd do; Normally we'd want content inside a particular node. Comments are nodes, they're just special types of nodes.

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<body>
<!-- begin content -->
 <div>some text</div>
 <div><p>Some more elements</p></div>
<!-- end content -->
</body>
EOT

By looking for a comment containing the specified text it's possible to find a starting node:

start_comment = doc.at("//comment()[contains(.,'begin content')]") # => #<Nokogiri::XML::Comment:0x3fe94994268c " begin content ">

Once that's found then a loop is needed that stores the current node, then looks for the next sibling until it finds another comment:

content = Nokogiri::XML::NodeSet.new(doc)
contained_node = start_comment.next_sibling
loop do
  break if contained_node.comment?
  content << contained_node
  contained_node = contained_node.next_sibling
end

content.to_html # => "\n <div>some text</div>\n <div><p>Some more elements</p></div>\n"
the Tin Man
  • 158,662
  • 42
  • 215
  • 303