What you're trying to do is strongly discouraged and "RegEx match open tags except XHTML self-contained tags" helps explain why. Only in the most trivial cases where you own the generation of the code should you use patterns. If you don't own the generator, then any change in the HTML can break your code, often in ways that are irreparable, especially late at night during a critical outage with your boss hounding you to get it running immediately.
Using Nokogiri, this will get you into the ballpark in a more robust and recommended way. This example only collects the h2
and following p
nodes. Figuring out how to display them is left as an exercise.
require 'nokogiri'
html = <<EOT
<h2>heading 1</h2>
<p>content 1a<b>test</b></p>
<p>content 1b</p>
<h2>heading 2</h2>
<p>content 2a</p>
EOT
doc = Nokogiri::HTML.parse(html)
output = doc.search('h2').map { |h|
next_node = h.next_sibling
break unless next_node
paragraphs = []
loop do
case
when next_node.text? && next_node.blank?
when next_node.name == 'p'
paragraphs << next_node
else
break
end
next_node = next_node.next_sibling
break unless next_node
end
[h, paragraphs]
}
Which results in output
containing an array of arrays containing the nodes:
# => [[#(Element:0x3ff4e4034be8 {
# name = "h2",
# children = [ #(Text "heading 1")]
# }),
# [#(Element:0x3ff4e4034b98 {
# name = "p",
# children = [
# #(Text "content 1a"),
# #(Element:0x3ff4e3807ccc {
# name = "b",
# children = [ #(Text "test")]
# })]
# }),
# #(Element:0x3ff4e4034ad0 {
# name = "p",
# children = [ #(Text "content 1b")]
# })]],
# [#(Element:0x3ff4e4034a6c {
# name = "h2",
# children = [ #(Text "heading 2")]
# }),
# [#(Element:0x3ff4e40349a4 {
# name = "p",
# children = [ #(Text "content 2a")]
# })]]]
The code makes some assumptions about the format of the HTML also, but won't spit out garbage if the format changes. It assumes a format like:
<h2>
<p>
...
where h2
is always followed by p
tags until some other tag occurs, including a subsequent h2
.
This test:
when next_node.text? && next_node.blank?
is necessary because HTML doesn't require formatting, but when it is there are "TEXT" nodes inserted that contain only whitespace which results in the indentation we expect with "pretty HTML". The parser and browser don't care whether it's there except in the case of preformatted text, only humans do. And actually it'd be better to not have them because they bloat the file and slow down transmission of it. But people are finicky that way. In reality the HTML sample in the code really looks more like:
<h2>heading 1</h2>\n<p>content 1a<b>test</b></p>\n<p>content 1b</p>\n\n<h2>heading 2</h2>\n<p>content 2a</p>\n
and the when
statement is ignoring those "\n
" nodes.