I'm implementing a tool that needs to crawl a website. I'm using anemone to crawl and on each anemone's page I'm using boilerpipe and Nokogiri to manage HTML format, etc.
My problem is: if I get 500 Internal Server Error, it makes Nokogiri fail because there is no page.
Anemone.crawl(name) do |anemone|
anemone.on_every_page do |page|
if not (page.nil? && page.not_found?)
result = Boilerpipe.extract(page.url, {:output => :htmlFragment, :extractor => :ArticleExtractor})
doc = Nokogiri::HTML.parse(result)
end
end
end
In the case above, if there is a 500 Internal Server Error, the application will give an error on Nokogiri::HTML.parse(). I want to avoid this problem. If the server gives an error I want to continue computation ignoring this page.
There is any way to handle 500 Internal Server Error and 404 Page Not Found with these tools?
Kind regards, Hugo