Ruby grep all lines inside tag

Question

How to grep all lines within the body tag using ruby? I know this can be solved with Nokogiri but I want to learn how to do it.

Example:

<body>
  <h1>Hello world</h1>
  <div>
    <button>Submit</button>
  </div>
</body>

From the above example, I want all the lines within the body tag, which is h1, div, and button element.

File path link: "#{Rails.root}/app/templates/example.html"

is also within . Are you looking for closest child elements only? — JDelorean, Sep 08 '20 at 16:01
What does _"all the lines"_ mean? Do you want a multi-line string or the Nokogiri nodes or something else? Please be more specific. — Stefan, Sep 08 '20 at 16:03
Don't parse HTML with regular expressions unless you *really* know what you're doing and all the ways it can go wrong. In any non-trivial case, you need to *parse* HTML for your results to be reliable. — Todd A. Jacobs, Sep 08 '20 at 16:44
@ToddA.Jacobs Are there any vulnerabilities? Because the file is from my own computer. — Abeid Ahmed, Sep 08 '20 at 16:46
@AbeidAhmed It's not just about untrusted data; it's about [irregular data](https://stackoverflow.com/a/1732454/1301972). Unless it's part of a text fixture, you can pretty much count on regex solutions for HTML parsing to fail eventually, except in the most trivial of use cases. YMMV. — Todd A. Jacobs, Sep 08 '20 at 16:52

score 2 · Answer 1 · answered Sep 08 '20 at 16:42

Use XPath

You can collect the nodes within your body tag using XPath as follows:

require 'nokogiri'

html_fragment = <<~'EOF'
  <body>
    <h1>Hello world</h1>
    <div>
      <button>Submit</button>
    </div>
  </body>
EOF

fragment = Nokogiri::HTML.parse html_fragment
nodes    = fragment.xpath './/body/*'

After that, you can do whatever you like with the nodes to address your specific use case. Some examples include:

nodes.map &:text
#=> ["Hello world", "\nSubmit\n"]

nodes.map &:to_s
#=> ["<h1>Hello world</h1>", "<div>\n<button>Submit</button>\n</div>"]

nodes.to_html
#=> "<h1>Hello world</h1><div>\n<button>Submit</button>\n</div>"

nodes.inner_html
#=> "Hello world\n<button>Submit</button>\n"

The `html` is inside a file and when I try to read the file like `File.read(path)[%r{(.*)}m, 1]`, it is returning `nil`. – Abeid Ahmed Sep 08 '20 at 16:34
It is working, there was a `class` attr in `body` tag. Thanks! – Abeid Ahmed Sep 08 '20 at 16:37
While not bulletproof, you can anchor the body tags so that they won't pick up stray comments from the fragment, e.g. `str[%r{^\n(.*)\n}m, 1]`. However, as whitespace can vary without changing the semantics of the fragment, it still can't be as reliable as an actual parse. – Todd A. Jacobs Sep 08 '20 at 16:56

Ruby grep all lines inside tag

2 Answers2

Use XPath

See Also