-1

How can I parse and group the example HTML with Ruby?

HTML text:

<h2>heading one</h2>
<p>different content in here <a>test</a> <b>test</b></p>
<p>different content in here <a>test</a> <b>test</b></p>

<h2>heading two</h2>
<p>different content in here <a>test</a> <b>test</b></p>

<h2>heading three</h2>
<p>different content in here <a>test</a> <b>test</b></p>
<p>different content in here <a>test</a> <b>test</b></p>
<p>different content in here <a>test</a> <b>test</b></p>

Elements are not nested, and I want to group them by heading. When I find a <h2>, I want to extract its text and all the content that comes after it as is until encountering the next <h2>. The last heading does not have another h2 as a delimiter.

This is example output:

- Heading one
"<p>different content in here <a>test</a> <b>test</b></p>
<p>different content in here <a>test</a> <b>test</b></p>"

- Heading 2
"<p>different content in here <a>test</a> <b>test</b></p>"
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Alex A
  • 156
  • 1
  • 10

3 Answers3

2

You can do it very quickly with Nokogiri without having to parse your HTML with regex.

You’ll be able to get the h2 elements then extract the content in them.

Some examples ar at https://www.rubyguides.com/2012/01/parsing-html-in-ruby/

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
cercxtrova
  • 1,555
  • 13
  • 30
0

This should work,
Group 1 contains the heading text, Group 2 contains the body.

Whitespace trim is included

/<h2\s*>\s*([\S\s]*?)\s*<\/h2\s*>\s*([\S\s]*?)(?=\s*<h2\s*>|\s*$)/

https://regex101.com/r/pgLIi0/1

Readable regex

 <h2 \s* >
 \s*     
 ( [\S\s]*? )                  # (1) Heading
 \s* 
 </h2 \s* >
 \s*   
 ( [\S\s]*? )                  # (2) Body
 (?= \s* <h2 \s* > | \s* $ )
0

What you're trying to do is strongly discouraged and "RegEx match open tags except XHTML self-contained tags" helps explain why. Only in the most trivial cases where you own the generation of the code should you use patterns. If you don't own the generator, then any change in the HTML can break your code, often in ways that are irreparable, especially late at night during a critical outage with your boss hounding you to get it running immediately.

Using Nokogiri, this will get you into the ballpark in a more robust and recommended way. This example only collects the h2 and following p nodes. Figuring out how to display them is left as an exercise.

require 'nokogiri'

html = <<EOT
<h2>heading 1</h2>
<p>content 1a<b>test</b></p>
<p>content 1b</p>

<h2>heading 2</h2>
<p>content 2a</p>
EOT

doc = Nokogiri::HTML.parse(html)

output = doc.search('h2').map { |h|

  next_node = h.next_sibling
  break unless next_node

  paragraphs = []

  loop do

    case 
    when next_node.text? && next_node.blank?
    when next_node.name == 'p'
      paragraphs << next_node 
    else
      break
    end

    next_node = next_node.next_sibling
    break unless next_node

  end

  [h, paragraphs]
}

Which results in output containing an array of arrays containing the nodes:

# => [[#(Element:0x3ff4e4034be8 {
#        name = "h2",
#        children = [ #(Text "heading 1")]
#        }),
#      [#(Element:0x3ff4e4034b98 {
#         name = "p",
#         children = [
#           #(Text "content 1a"),
#           #(Element:0x3ff4e3807ccc {
#             name = "b",
#             children = [ #(Text "test")]
#             })]
#         }),
#       #(Element:0x3ff4e4034ad0 {
#         name = "p",
#         children = [ #(Text "content 1b")]
#         })]],
#     [#(Element:0x3ff4e4034a6c {
#        name = "h2",
#        children = [ #(Text "heading 2")]
#        }),
#      [#(Element:0x3ff4e40349a4 {
#         name = "p",
#         children = [ #(Text "content 2a")]
#         })]]]

The code makes some assumptions about the format of the HTML also, but won't spit out garbage if the format changes. It assumes a format like:

<h2>
<p>
...

where h2 is always followed by p tags until some other tag occurs, including a subsequent h2.

This test:

when next_node.text? && next_node.blank?

is necessary because HTML doesn't require formatting, but when it is there are "TEXT" nodes inserted that contain only whitespace which results in the indentation we expect with "pretty HTML". The parser and browser don't care whether it's there except in the case of preformatted text, only humans do. And actually it'd be better to not have them because they bloat the file and slow down transmission of it. But people are finicky that way. In reality the HTML sample in the code really looks more like:

<h2>heading 1</h2>\n<p>content 1a<b>test</b></p>\n<p>content 1b</p>\n\n<h2>heading 2</h2>\n<p>content 2a</p>\n

and the when statement is ignoring those "\n" nodes.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • The post you linked actually explains nothing. – Robert Harvey Jul 21 '19 at 01:37
  • In my reading of it it has some very good points about why regex fails, when it's possible to use it, and why parsers are more robust. Probably the link should point to the question itself, so I can tweak that, otherwise I find it a usable page with good discussions. – the Tin Man Jul 21 '19 at 01:39
  • FWIW, Nokogiri isn't a *standard;* it's just a library. It might be a very popular library, but popularity doesn't make something a "standard;" it just makes it popular. – Robert Harvey Jul 21 '19 at 01:42
  • I don't see the word "standard" anywhere on the page, except in your comment. Nokogiri is the most popular (de facto) parser for Ruby though, and it was the tag the OP used. I'm unsure what the comment was about. – the Tin Man Jul 21 '19 at 01:48
  • You made that assertion in one of your comments in that appalling (now deleted) conversation below the OP. – Robert Harvey Jul 21 '19 at 01:52
  • Ah, well, usually I say "de facto". – the Tin Man Jul 21 '19 at 01:54