7

I wrote a script that will go through all of the customers in our database, verify that their website URL works, and try to find a twitter link on their homepage. We have a little over 10,000 URLs to verify. After a fraction of if the urls are verified, we start getting getaddrinfo errors for every URL.

Here's a copy of the code that scrapes a single URL:

def scrape_url(url) 
  url_found = false 
  twitter_name = nil 

  begin 
    agent = Mechanize.new do |a| 
      a.follow_meta_refresh = true 
    end 

    agent.get(normalize_url(url)) do |page| 
      url_found = true 
      twitter_name = find_twitter_name(page) 
    end 

    @err << "[#{@current_record}] SUCCESS\n" 
  rescue Exception => e 
    @err << "[#{@current_record}] ERROR (#{url}): " 
    @err << e.message 
    @err << "\n" 
  end 

  [url_found, twitter_name] 
end

Note: I've also run a version of this code that creates a single Mechanize instance that gets shared across all calls to scrape_url. It failed in exactly the same fashion.

When I run this on EC2, it gets through almost exactly 1,000 urls, then returns this error for the remaining 9,000+:

getaddrinfo: Temporary failure in name resolution

Note, I've tried using both Amazon's DNS servers and Google's DNS servers, thinking it might be a legitimate DNS issue. I got exactly the same result in both cases.

Then, I tried running it on my local MacBook Pro. It only got through about 250 before returning this error for the remainder of the records:

getaddrinfo: nodename nor servname provided, or not known

Does anyone know how I can get the script to make it through all of the records?

EricM
  • 264
  • 2
  • 13
  • Show us the url it's failing on. – pguardiario Nov 01 '12 at 22:24
  • It fails on around 9,000 of them. One example is http://www.agilecommerce.com. The URLs tend to work if plugged into a browser. – EricM Nov 01 '12 at 22:58
  • could you be running out of memory? – pguardiario Nov 02 '12 at 00:05
  • Try adding something to throttle your requests. I wouldn't be surprised if your DNS provider isn't getting upset and refusing your connection. – the Tin Man Nov 02 '12 at 00:10
  • You don't say what host OS you're running, but it looks like [Fedora had some problems that returned the same error](https://www.google.com/search?q=getaddrinfo%3A+Temporary+failure+in+name+resolution&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a). – the Tin Man Nov 02 '12 at 00:14
  • I might have found a potential solution. I set keep_alive to false and set a 1 second idle timeout. My theory is that Mechanize was keeping the connections open until they timed out. At some point, a maximum number of connections was hit and it couldn't make another to do a DNS lookup. Strictly a theory at this point, but I'm just shy of 3,000 records processed. – EricM Nov 02 '12 at 00:40

2 Answers2

10

I found the solution. Mechanize was leaving the connection open and relying on GC to clean them up. After a certain point, there were enough open connections that no additional outbound connection could be established to do a DNS lookup. Here's the code that caused it to work:

agent = Mechanize.new do |a| 
  a.follow_meta_refresh = true
  a.keep_alive = false
end

By setting keep_alive to false, the connection is immediately closed and cleaned up.

EricM
  • 264
  • 2
  • 13
0

See if this helps:

agent.history.max_size = 10

It will keep the history from using too much memory

pguardiario
  • 53,827
  • 19
  • 119
  • 159