0

How to grep all lines within the body tag using ruby? I know this can be solved with Nokogiri but I want to learn how to do it.

Example:

<body>
  <h1>Hello world</h1>
  <div>
    <button>Submit</button>
  </div>
</body>

From the above example, I want all the lines within the body tag, which is h1, div, and button element.

File path link: "#{Rails.root}/app/templates/example.html"

Abeid Ahmed
  • 315
  • 1
  • 5
  • 15
  • is also within . Are you looking for closest child elements only? – JDelorean Sep 08 '20 at 16:01
  • 1
    What does _"all the lines"_ mean? Do you want a multi-line string or the Nokogiri nodes or something else? Please be more specific. – Stefan Sep 08 '20 at 16:03
  • I want multi-line strings. – Abeid Ahmed Sep 08 '20 at 16:10
  • @JDelorean all the elements inside `body` tag. – Abeid Ahmed Sep 08 '20 at 16:11
  • Don't parse HTML with regular expressions unless you *really* know what you're doing and all the ways it can go wrong. In any non-trivial case, you need to *parse* HTML for your results to be reliable. – Todd A. Jacobs Sep 08 '20 at 16:44
  • @ToddA.Jacobs Are there any vulnerabilities? Because the file is from my own computer. – Abeid Ahmed Sep 08 '20 at 16:46
  • 1
    @AbeidAhmed It's not just about untrusted data; it's about [irregular data](https://stackoverflow.com/a/1732454/1301972). Unless it's part of a text fixture, you can pretty much count on regex solutions for HTML parsing to fail eventually, except in the most trivial of use cases. YMMV. – Todd A. Jacobs Sep 08 '20 at 16:52

2 Answers2

2

Use XPath

You can collect the nodes within your body tag using XPath as follows:

require 'nokogiri'

html_fragment = <<~'EOF'
  <body>
    <h1>Hello world</h1>
    <div>
      <button>Submit</button>
    </div>
  </body>
EOF

fragment = Nokogiri::HTML.parse html_fragment
nodes    = fragment.xpath './/body/*'

After that, you can do whatever you like with the nodes to address your specific use case. Some examples include:

nodes.map &:text
#=> ["Hello world", "\nSubmit\n"]

nodes.map &:to_s
#=> ["<h1>Hello world</h1>", "<div>\n<button>Submit</button>\n</div>"]

nodes.to_html
#=> "<h1>Hello world</h1><div>\n<button>Submit</button>\n</div>"

nodes.inner_html
#=> "Hello world\n<button>Submit</button>\n"

See Also

Todd A. Jacobs
  • 81,402
  • 15
  • 141
  • 199
1

Your description isn't precise enough to understand exactly what you want.

str = <<~STR
<body>
  <h1>Hello world</h1>
  <div>
    <button>Submit</button>
  </div>
</body>
STR

str[%r{<body>(.*)</body>}m, 1]

kind of does what you describe, but it won't be reliable in all cases. It will begin and end capture between any <body> and </body>, respectively, even if those characters are found in HTML comments. An example that would fail:

<body>
  <h1>Hello world</h1>
  <div>
    <button>Submit</button>
  </div>
</body>
<!-- </body> -->
Kache
  • 15,647
  • 12
  • 51
  • 79
  • The `html` is inside a file and when I try to read the file like `File.read(path)[%r{(.*)}m, 1]`, it is returning `nil`. – Abeid Ahmed Sep 08 '20 at 16:34
  • It is working, there was a `class` attr in `body` tag. Thanks! – Abeid Ahmed Sep 08 '20 at 16:37
  • While not bulletproof, you can anchor the body tags so that they won't pick up stray comments from the fragment, e.g. `str[%r{^\n(.*)\n}m, 1]`. However, as whitespace can vary without changing the semantics of the fragment, it still can't be as reliable as an actual parse. – Todd A. Jacobs Sep 08 '20 at 16:56