2

I'm working on a Ruby script that uses Nokogiri and CSS selectors. I'm trying to scrape some data from HTML that looks like this:

<h2>Title 1</h2>
(Part 1)
<h2>Title 2</h2>
(Part 2)
<h2>Title 3</h2>
(Part 3)

Is there a way to select from Part 2 only by specifying the text of the h2 elements that represent the start and end points?

The data of interest in Part 2 is a table with tr and td elements that don't have any class or id identifiers. The other parts also have tables I'm not interested in. Something like

page.css('table tr td')

on the entire page would select from all of those other tables in addition to the one I'm after, and I'd like to avoid that if at all possible.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Welcome to SO! Please see "[ask]" and the linked pages and "[mcve](https://stackoverflow.com/help/minimal-reproducible-example)". If a co-worker handed you the question on a paper and stepped away, would you be able to understand it and answer it? If not, what would you want to know? That's the sort of information we need also otherwise we have to make a lot of assumptions and guesses. What did you try? Why didn't it work? – the Tin Man Oct 18 '19 at 05:48
  • CSS selectors are not really up to the job. Instead you'd be better off using XPath which has a much more rich set of tools for looking at embedded text and siblings. – the Tin Man Oct 18 '19 at 05:53

4 Answers4

1

I'd probably use this as a first attempt:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
  <h2>Title 1</h2>
    (Part 1)
  <h2>Title 2</h2>
    <table>
      <tr><td>(Part 2)</td></tr>
    </table>
  <h2>Title 3</h2>
    (Part 3)
EOT

doc.css('h2')[1].next_element
  .to_html # => "<table>\n      <tr><td>(Part 2)</td></tr>\n    </table>"

Alternately, rather than use css('h2')[1], I could pass some of the task to the CSS selector:

doc.at('h2:nth-of-type(2)').next_element
  .to_html # => "<table>\n      <tr><td>(Part 2)</td></tr>\n    </table>"

Once you have the table then it's easy to grab data from it. There are lots of examples how to do it out there.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
0

According to "Is there a CSS selector for elements containing certain text?", I'm afraid there is no CSS selector working on element text. How about first extract "(Part 2)", and then using Nokogiri to select table elements inside it?

text = "" //your string, or content from a file

part2 = text.scan(/<h2>Title 2<\/h2>\s+(.+)?<h2>/ms).first.first

doc = Nokogiri::HTML(part2)

# continue select table elements from doc

(Part 2) can not contain any h2 tag, or the regex should be different.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
sam
  • 1,767
  • 12
  • 15
  • Don't use regular expressions to search HTML. It's really easy for the pattern to break or leak if the HTML changes. Instead use a parser. – the Tin Man Dec 08 '19 at 00:32
0

If you know that the tables will be static, and the data you require will always be in the second table. You can do something like:

page.css('table')[1].css('tr')[3].css('td')

This will get us the second table on the page, access the 4th row of that table and get us all the values of that row.

I haven't tested this, but this would be the way I would do it if the table I require doesn't have a class or identifier.

NemyaNation
  • 983
  • 1
  • 12
  • 22
0

I'd probably use this as a first attempt:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
  <h2>Title 1</h2>
    (Part 1)
  <h2>Title 2</h2>
    <table>
      <tr><td>(Part 2)</td></tr>
    </table>
  <h2>Title 3</h2>
    (Part 3)
EOT

doc.css('h2')[1].next_element.to_html # => "<table>\n      <tr><td>(Part 2)</td></tr>\n    </table>"

Alternately, rather than use css('h2')[1], I could pass some of the task to the CSS selector:

doc.at('h2:nth-of-type(2)').next_element
  .to_html # => "<table>\n      <tr><td>(Part 2)</td></tr>\n    </table>"

next_element is the trick used to find the node following the current one. There are many "next" and "previous" methods so read up on them as they're very useful for this sort of situation.

Finally, to_html is used above to show us what Nokogiri returned in a more friendly output. You wouldn't use it unless it was necessary to output HTML.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303