2

I'm implementing a tool that needs to crawl a website. I'm using anemone to crawl and on each anemone's page I'm using boilerpipe and Nokogiri to manage HTML format, etc.

My problem is: if I get 500 Internal Server Error, it makes Nokogiri fail because there is no page.

Anemone.crawl(name) do |anemone|
   anemone.on_every_page do |page|
       if not (page.nil? && page.not_found?)
              result = Boilerpipe.extract(page.url, {:output => :htmlFragment, :extractor => :ArticleExtractor})
              doc = Nokogiri::HTML.parse(result)

       end
    end
end

In the case above, if there is a 500 Internal Server Error, the application will give an error on Nokogiri::HTML.parse(). I want to avoid this problem. If the server gives an error I want to continue computation ignoring this page.

There is any way to handle 500 Internal Server Error and 404 Page Not Found with these tools?

Kind regards, Hugo

Charles
  • 50,943
  • 13
  • 104
  • 142
Hugo Sousa
  • 906
  • 2
  • 9
  • 27
  • Ok, I think I found the solution. Each anemone page has a field called code. This code represents server's response. – Hugo Sousa Sep 02 '13 at 20:57

2 Answers2

5
# gets the reponse of the link
res = Net::HTTP.get_response(URI.parse(url))

# if it returns a good code
if res.code.to_i >= 200 && res.code.to_i < 400 #good codes will be betweem 200 - 399
  # do something with the url
else
  # skip the object
  next
end
davegson
  • 8,205
  • 4
  • 51
  • 71
  • Thank you so much TheChamp, I find right now that anemone's page has a field called code. – Hugo Sousa Sep 02 '13 at 20:58
  • Ok, glad to hear, if you want to, you can try my solution and tell me if it worked aswell! – davegson Sep 02 '13 at 21:03
  • It works too but there is a problem: when anemone performs the request action it could receive 500 Internal Server Error and then when you execute your code it could give a good reply. – Hugo Sousa Sep 02 '13 at 23:43
  • There shouldn't be that possibility, could you post a url to such a site or further explain the problem? – davegson Sep 03 '13 at 07:46
0

I ran into a similar problem. The question and the reply is here

How to handle 404 errors with Nokogiri

Community
  • 1
  • 1
Bala
  • 11,068
  • 19
  • 67
  • 120
  • Ok, I think I found the solution. Each anemone page has a field called code. This code represents server's response. – Hugo Sousa Sep 02 '13 at 20:59