Scraping with Ruby and storing in a hash

Question

I wrote Ruby scraper to grab campaign finance data from the California senate and then save each individual as a hash. Here's the code so far:

Here's the main website: http://cal-access.sos.ca.gov/Campaign/Candidates/

here's an example of a candidate page: http://cal-access.sos.ca.gov/Campaign/Committees/Detail.aspx?id=1342974&session=2011&view=received

And here's the github repo incase you want to see my comments in the code: https://github.com/aboutaaron/Baugh-For-Senate-2012/blob/master/final-exam.rb

On to the code...

require 'nokogiri'
require 'open-uri'

campaign_data =  Nokogiri::HTML(open('http://cal-access.sos.ca.gov/Campaign/Candidates/'))

class Candidate
def initialize(url)
    @url = url
    @cal_access_url = "http://cal-access.sos.ca.gov"
    @nodes =  Nokogiri::HTML(open(@cal_access_url + @url))
end

def get_summary
    candidate_page = @nodes

    {
        :political_party => candidate_page.css('span.hdr15').text,
        :current_status => candidate_page.css('td tr:nth-child(2) td:nth-child(2) .txt7')[0].text,
        :last_report_date => candidate_page.css('td tr:nth-child(3) td:nth-child(2) .txt7')[0].text,
        :reporting_period => candidate_page.css('td tr:nth-child(4) td:nth-child(2) .txt7')[0].text,
        :contributions_this_period => candidate_page.css('td tr:nth-child(5) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, ''),
        :total_contributions_this_period => candidate_page.css('td tr:nth-child(6) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, ''),
        :expenditures_this_period => candidate_page.css('td tr:nth-child(7) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, ''),
        :total_expenditures_this_period => candidate_page.css('td tr:nth-child(8) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, ''),
        :ending_cash => candidate_page.css('td tr:nth-child(9) td:nth-child(2) .txt7')[0].text.gsub(/[$,](?=\d)/, '')
    }
end

def get_contributors
    contributions_received = @nodes
    grab_contributor_page = @nodes.css("a.sublink6")[0]['href']
    contributor_page = Nokogiri::HTML(open(@cal_access_url + grab_contributor_page))
    grab_contributions_page = contributor_page.css("a")[25]["href"]
    contributions_received = Nokogiri::HTML(open(@cal_access_url + grab_contributions_page))
    puts
    puts "#{@cal_access_url}" + "#{grab_contributions_page}"
    puts

    contributions_received.css("table").reduce([]) do |memo, contributors|
        begin

            memo << {
                :name_of_contributor => contributions_received.css("table:nth-child(57) tr:nth-child(2) td:nth-child(1) .txt7").text
            }

        rescue NoMethodError => e
            puts e.message
            puts "Error on #{contributors}"
        end
        memo
    end
end

end

campaign_data.css('a.sublink2').each do |candidates|
puts "Just grabbed the page for " + candidates.text
candidate = Candidate.new(candidates["href"])
p candidate.get_summary
end

get_summary works as planned. get_contributors stores the first contributor <td> as planned, but does it 20-plus times. I'm only choosing to grab the name for now until I figure out the multiple printing issue.

The end goal is to have a hash of the contributors with all of their required information and possibly move them into a SQL database/Rails app. But, before, I just want a working scraper.

Any advice or guidance? Sorry if the code isn't super. Super newbie to programming.

Try to trim the code down to just the part that illustrates the problem. — pguardiario, Jun 23 '12 at 01:21
It sounds like you're asking for a general code review instead of a specific problem? — Phrogz, Jun 23 '12 at 03:39
@Phrogz Not exactly. I'm specifically having issues wit "get_contributors." When i run method, the CL puts a hash of the value about 20 times. I am trying to figure it out. Sorry if my post wasn't clear. — aboutaaron, Jun 26 '12 at 23:03

Wayne Conrad · Accepted Answer · 2012-06-23T03:47:33.333

You're doing great. Good job on providing a stand-alone sample. You'd be surprised how many don't do that.

I see two problems.

The first is that not all pages have the statistics you're looking for. This causes your parsing routines to get a bit upset. To guard against that, you can put this in get_summary:

return nil if candidate_page.text =~ /has not electronically filed/i

The caller should then do something intelligent when it sees a nil.

The other problem is that the server sometimes doesn't respond in a timely fashion, so the script times out. If you think the server is getting upset at the rate with which your script is making requests, you can try adding some sleeps to slow it down. Or, you could add a retry loop. Or, you could increase the amount of time it takes for your script to time out.

There is also some duplication of logic in get_summary. This function might benefit from a separation of policy from logic. The policy is what data to retrieve from the page, and how to format it:

FORMAT_MONEY = proc do |s|
  s.gsub(/[$,](?=\d)/, '')
end

FIELDS = [
  [:political_party, 'span.hdr15'],
  [:current_status, 'td tr:nth-child(2) td:nth-child(2) .txt7'],
  [:last_report_date, 'td tr:nth-child(3) td:nth-child(2) .txt7'],
  [:reporting_period, 'td tr:nth-child(4) td:nth-child(2) .txt7'],
  [:contributions_this_period, 'td tr:nth-child(5) td:nth-child(2) .txt7', FORMAT_MONEY],
  [:total_contributions_this_period, 'td tr:nth-child(6) td:nth-child(2) .txt7', FORMAT_MONEY],
  [:expenditures_this_period, 'td tr:nth-child(7) td:nth-child(2) .txt7', FORMAT_MONEY],
  [:total_expenditures_this_period, 'td tr:nth-child(8) td:nth-child(2) .txt7', FORMAT_MONEY],
  [:ending_cash, 'td tr:nth-child(9) td:nth-child(2) .txt7', FORMAT_MONEY],
]

The implementation is how to apply that policy to the HTML page:

def get_summary
  candidate_page = @nodes
  return nil if candidate_page.text =~ /has not electronically filed/i
  keys_and_values = FIELDS.map do |key, css_selector, format|
    value = candidate_page.css(css_selector)[0].text
    value = format[value] if format
    [key, value]
  end
  Hash[keys_and_values]
end

Thanks @wayne-conrad. I am definitely going to clean up the code with your suggestions. Do you have any idea why `get_contributors` is printing a hash with multiple responses? — aboutaaron, Jun 28 '12 at 02:32
@aboutaaron Ah, that. Sorry I missed that--I didn't read your original question well enough, and the code in your question doesn't actually exercise that method. The trouble is that the CSS selector in `contributions_received.css("table") is not specific enough. There are 40 or more tables in the page, so the selector is matching each one of them. — Wayne Conrad, Jun 28 '12 at 12:00
that's what I thought. I'm just stuck on how to get more specific and nail the content I want. If you see a way to do this, please let me know. Thanks! — aboutaaron, Jul 02 '12 at 02:32
@aboutaaron, does this help? http://stackoverflow.com/questions/2114695/extract-single-string-from-html-using-ruby-mechanize-and-nokogiri/2114744#2114744 — Wayne Conrad, Jul 02 '12 at 12:56

Scraping with Ruby and storing in a hash

1 Answers1