1

I am trying to scrape a web page with a lot of javascript. with the help of pguardiano i have this piece of code in ruby.

 require 'rubygems'
 require 'watir-webdriver'
 require 'csv'
 @browser = Watir::Browser.new
 @browser.goto 'http://www.oddsportal.com/matches/soccer/'
 CSV.open('out.csv', 'w') do |out|
 @browser.trs(:class => /deactivate/).each do |tr|
    out << tr.tds.map(&:text)
 end
 end

The scraping is done recursively in background with a sleep time of 1 hour approximatively. I have no experience of ruby and in particular of web scraping, so i have a couple of questions.

  1. How can i avoid that every time a new firefox session is opened with a lot of cpu and ram consumption?

  2. Is it possible to use a firefox engine without using his GUI?

emanuele
  • 2,519
  • 8
  • 38
  • 56
  • see answer here http://stackoverflow.com/questions/5370762/how-to-hide-firefox-window-firefox-webdriver – peter Apr 07 '12 at 16:22

1 Answers1

2

You can try a headless option.

require 'watir-webdriver'
require 'headless'
headless = Headless.new
headless.start
b = Watir::Browser.start 'www.google.com'
puts b.title
b.close
headless.destroy

An alternative is to use the selenium server. A third alternative is to use a scraper like Kapow.

Dave McNulla
  • 2,006
  • 16
  • 23
  • I would think you might be better off to make use of a lower level solution such as something like the HTTP-Party gem to make the request and get the response, and then Nokogiri to parse the HTML. Watir is more for functional TESTING of a website, and while I can be used to do scraping that is not its primary purpose so it may not be an ideal solution – Chuck van der Linden Apr 08 '12 at 23:41
  • I agree. If I wanted a cheap/easy scraper library, I would use Mechanize with Nokogiri. But that doesn't always work with javascript websites as emenuele mentioned. Watir or Watir-Webdriver does. – Dave McNulla Apr 09 '12 at 07:01
  • yeah if there is a lot of client side code you need a real browser, or something very very close to one. – Chuck van der Linden Apr 10 '12 at 07:51