parse html tree with nested loops using nokogiri

Question

Hi I'm new to nokogiri and trying to parse an HTML document with a varied tree structure. Any suggestions on how to go about parsing it would be great. I'd like to capture all the text on this page.

<div class = "main"> Title</div>
<div class = "subTopic">
    <span = "highlight">Sub Topic</span>Stuff
</div>

<div class = "main"> Another Title</div>
<div class = "subTopic">
    <span class = "highlight">Sub Topic Title I</span>Stuff<br>
    <span class = "highlight">Sub Topic Title II</span>Stuff<br>
    <span class = "highlight">Sub Topic Title III</span>Stuff<br>
</div>

I tried this but it just puts out each full array and I'm not even sure how to get to the "Stuff" part.

content = Nokogiri::HTML(open(@url))
content.css('div.main').each do |m|
    puts m .text
    content.css('div.subTopic').each do |s|
        puts s.text
        content.css('span.highlight').each do |h|
            puts h.text
        end
    end
end

Help will be appreciated.

Is there a particular reason you are using nokogiri to do this? — dezman, Mar 14 '13 at 04:32
i'm doing it in Rails/Ruby. is there another tool you'd suggest? — haley, Mar 14 '13 at 04:35
Depending on your situation it might be best to do it client side with JS. — Web_Designer, Mar 14 '13 at 04:45
Oh yeah I'm saving to a database to use on other pages so need server side. — haley, Mar 14 '13 at 05:15

score 0 · Accepted Answer · answered Mar 14 '13 at 05:01

Something like that would parse your give document structure:

Data

<div class="main"> Title</div>
<div class="subTopic">
    <span class="highlight">Sub Topic</span>Stuff
</div>

<div class = "main"> Another Title</div>
<div class = "subTopic">
    <span class = "highlight">Sub Topic Title I</span>Stuff<br>
    <span class = "highlight">Sub Topic Title II</span>Stuff<br>
    <span class = "highlight">Sub Topic Title III</span>Stuff<br>
</div>

Code:

require 'nokogiri'
require 'pp'

content = Nokogiri::HTML(File.read('text.txt'));

topics = content.css('div.main').map do |m|
    topic={}
    topic['title'] = m.text.strip
    topic['highlights'] = m.xpath('following-sibling::div[@class=\'subTopic\'][1]').css('span.highlight').map do |h|
      topic_highlight = {}
      topic_highlight['highlight'] = h.text.strip
      topic_highlight['text'] = h.xpath('following-sibling::text()[1]').text.strip
      topic_highlight
    end
    topic
end

pp topics

Will print:

[{"title"=>"Title",
  "highlights"=>[{"highlight"=>"Sub Topic", "text"=>"Stuff"}]},
 {"title"=>"Another Title",
  "highlights"=>
   [{"highlight"=>"Sub Topic Title I", "text"=>"Stuff"},
    {"highlight"=>"Sub Topic Title II", "text"=>"Stuff"},
    {"highlight"=>"Sub Topic Title III", "text"=>"Stuff"}]}]

Thank you @Strelok! Really helpful. I got it to work but .map is new to me. Tried researching it and got to Enumerable but still can't quite get why 'topic' and 'topic_highlight' are used at the end of their loops. I tried leaving them out and it appears they act like a counter. Is that right? Or if the answer is too long if you don't mind pointing to topics I can Google that would be great. Thanks again. — haley, Mar 14 '13 at 07:01
[What does the “map” method do in Ruby?](http://stackoverflow.com/questions/12084507/what-does-the-map-method-do-in-ruby) would answer your question about the `map` method. Every method in Ruby returns a value by default. This returned value will be the value of the last statement. So `topic` and `topic_highlight` are return values from the blocks. — Strelok, Mar 14 '13 at 22:44

parse html tree with nested loops using nokogiri

1 Answers1