0

I have been practicing writing a number of Ruby scrapers using Mechanize and Nokogiri. For instance here ( However, it seems that after making a certain number of requests (about 14000 in this case) I get an error saying I have a connection timed out error:

/var/lib/gems/1.8/gems/net-http-persistent-2.5.1/lib/net/http/persistent/ssl_reuse.rb:90:in `initialize': Connection timed out - connect(2) (Errno::ETIMEDOUT)

I have Googled a lot online, but the best answer I can get is that I am making too many requests to the server. Is there a way to fix this by throttling or some other method?

ZenBalance
  • 10,087
  • 14
  • 44
  • 44
  • See this thread regarding throttling: http://stackoverflow.com/questions/9241625/regulating-rate-limiting-ruby-mechanize also consider dropping back to version 1.0 which doesn't use persistent http connections – pguardiario Mar 10 '12 at 06:01
  • Version 1.0 for Mechanize or Nokogiri? – ZenBalance Mar 16 '12 at 06:46

1 Answers1

0

After some more programming experience, I realized that this was a simple error on my part: my code did not catch the error thrown and appropriately move to the next link when a link was corrupted.

For any novice Ruby programmers that encounter a similar problem:

The Connection timed out error is usually due to an invalid link, etc. on the page being scrapped.

You need to wrap the code that is accessing link in a statement such as the below

begin 
     #[1 your scraping code here ] 
rescue
     #[2 code to move to the next link/page/etc. that you are scraping instead of sticking to the invalid one] 
end

For instance, if you have a for loop that is iterating over links and extracting information from each link, then that should be at [1] and code to move to the next link (consider using something like ruby "next") should be placed at [2]. You might also consider printing something to the console to let the user know that a link was invalid.

ZenBalance
  • 10,087
  • 14
  • 44
  • 44