Updated Answer
Here's a general solution that creates a hierarchy of <section>
elements based on header levels and their following siblings:
class Nokogiri::XML::Node
# Create a hierarchy on a document based on heading levels
# wrap : e.g. "<section>" or "<div class='section'>"
# stops : array of tag names that stop all sections; use nil for none
# levels : array of tag names that control nesting, in order
def auto_section(wrap='<section>', stops=%w[hr], levels=%w[h1 h2 h3 h4 h5 h6])
levels = Hash[ levels.zip(0...levels.length) ]
stops = stops && Hash[ stops.product([true]) ]
stack = []
children.each do |node|
unless level = levels[node.name]
level = stops && stops[node.name] && -1
end
stack.pop while (top=stack.last) && top[:level]>=level if level
stack.last[:section].add_child(node) if stack.last
if level && level >=0
section = Nokogiri::XML.fragment(wrap).children[0]
node.replace(section); section << node
stack << { :section=>section, :level=>level }
end
end
end
end
Here is this code in use, and the result it gives.
The original HTML
<body>
<h1>Main Section 1</h1>
<p>Intro</p>
<h2>Subhead 1.1</h2>
<p>Meat</p><p>MOAR MEAT</p>
<h2>Subhead 1.2</h2>
<p>Meat</p>
<h3>Caveats</h3>
<p>FYI</p>
<h4>ProTip</h4>
<p>Get it done</p>
<h2>Subhead 1.3</h2>
<p>Meat</p>
<h1>Main Section 2</h1>
<h3>Jumpin' in it!</h3>
<p>Level skip!</p>
<h2>Subhead 2.1</h2>
<p>Back up...</p>
<h4>Dive! Dive!</h4>
<p>...and down</p>
<hr /><p id="footer">Copyright © All Done</p>
</body>
The conversion code
# Use XML only so that we can pretty-print the results; HTML works fine, too
doc = Nokogiri::XML(html,&:noblanks) # stripping whitespace allows indentation
doc.at('body').auto_section # make the magic happen
puts doc.to_xhtml # show the result with indentation
The result
<body>
<section>
<h1>Main Section 1</h1>
<p>Intro</p>
<section>
<h2>Subhead 1.1</h2>
<p>Meat</p>
<p>MOAR MEAT</p>
</section>
<section>
<h2>Subhead 1.2</h2>
<p>Meat</p>
<section>
<h3>Caveats</h3>
<p>FYI</p>
<section>
<h4>ProTip</h4>
<p>Get it done</p>
</section>
</section>
</section>
<section>
<h2>Subhead 1.3</h2>
<p>Meat</p>
</section>
</section>
<section>
<h1>Main Section 2</h1>
<section>
<h3>Jumpin' in it!</h3>
<p>Level skip!</p>
</section>
<section>
<h2>Subhead 2.1</h2>
<p>Back up...</p>
<section>
<h4>Dive! Dive!</h4>
<p>...and down</p>
</section>
</section>
</section>
<hr />
<p id="footer">Copyright All Done</p>
</body>
Original Answer
Here's an answer using no XPath, but Nokogiri. I've taken the liberty of making the solution somewhat flexible, handling arbitrary start/stops (but not nested sections).
html = "<h2>Header</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
<h2>Second header</h2>
<p>Third paragraph</p>
<p>Fourth paragraph</p>
<hr>
<p id='footer'>All done!</p>"
require 'nokogiri'
class Nokogiri::XML::Node
# Provide a block that returns:
# true - for nodes that should start a new section
# false - for nodes that should not start a new section
# :stop - for nodes that should stop any current section but not start a new one
def group_under(name="section")
group = nil
element_children.each do |child|
case yield(child)
when false, nil
group << child if group
when :stop
group = nil
else
group = document.create_element(name)
child.replace(group)
group << child
end
end
end
end
doc = Nokogiri::HTML(html)
doc.at('body').group_under do |node|
if node.name == 'hr'
:stop
else
%w[h1 h2 h3 h4 h5 h6].include?(node.name)
end
end
puts doc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <section><h2>Header</h2>
#=> <p>First paragraph</p>
#=> <p>Second paragraph</p></section>
#=>
#=> <section><h2>Second header</h2>
#=> <p>Third paragraph</p>
#=> <p>Fourth paragraph</p></section>
#=>
#=> <hr>
#=> <p id="footer">All done!</p>
#=> </body></html>
For XPath, see XPath : select all following siblings until another sibling