5

I have an unordered list of links that I save off to the side, and I want to click each link and make sure it goes to a real page and doesnt 404, 500, etc.

The issue is that I do not know how to do it. Is there some object I can inspect which will give me the http status code or anything?

mylinks = Browser.ul(:id, 'my_ul_id').links

mylinks.each do |link|
  link.click

  # need to check for a 200 status or something here! how?

  Browser.back
end
puffpio
  • 3,402
  • 6
  • 36
  • 41
  • 1
    Just for reference I recommend Xenu's Link Sleuth for this task: http://home.snafu.de/tilman/xenulink.html. I found it much easier and quicker to spider web pages this way. – kinofrost Apr 12 '11 at 12:36

4 Answers4

5

My answer is similar idea with the Tin Man's.

require 'net/http'
require 'uri'

mylinks = Browser.ul(:id, 'my_ul_id').links

mylinks.each do |link|
  u = URI.parse link.href
  status_code = Net::HTTP.start(u.host,u.port){|http| http.head(u.request_uri).code }
  # testing with rspec
  status_code.should == '200'
end

if you use Test::Unit for testing framework, you can test like the following, i think

  assert_equal '200',status_code

another sample (including Chuck van der Linden's idea): check status code and log out URLs if the status is not good.

require 'net/http'
require 'uri'

mylinks = Browser.ul(:id, 'my_ul_id').links

mylinks.each do |link|
  u = URI.parse link.href
  status_code = Net::HTTP.start(u.host,u.port){|http| http.head(u.request_uri).code }
  unless status_code == '200'
    File.open('error_log.txt','a+'){|file| file.puts "#{link.href} is #{status_code}" }
  end
end
  • Personally I'd think checking for one of the limited number of 'ok' return codes (perhaps just 200) would be better than looking that it doesn't equal a small number out of a potential large set of 'error' codes. For example are you going to count a 401 as passing? what about a 410? If it was me I'd pass it on a 200, and if the return code is anything else, spit it (and the URL) out to a error-log file of some sort that can be reviewed by a human. – Chuck van der Linden Apr 12 '11 at 17:45
  • @chuck-van-der-linden I edited my answer including your suggestion :) – Yutaka Yamaguchi Apr 13 '11 at 05:26
4

There's no need to use Watir for this. A HTTP HEAD request will give you an idea whether the URL resolves and will be faster.

Ruby's Net::HTTP can do it, or you can use Open::URI.

Using Open::URI you can request a URI, and get a page back. Because you don't really care what the page contains, you can throw away that part and only return whether you got something:

require 'open-uri'

if (open('http://www.example.com').read.any?)
  puts "is"
else
  puts "isn't"
end

The upside is the Open::URI resolves HTTP redirects. The downside is it returns full pages so it can be slow.

Ruby's Net::HTTP can help somewhat, because it can use HTTP HEAD requests, which don't return the entire page, only a header. That by itself isn't enough to know whether the actual page is reachable because the HEAD response could redirect to a page that doesn't resolve, so you have to loop through the redirects until you either don't get a redirect, or you get an error. The Net::HTTP docs have an example to get you started:

require 'net/http'
require 'uri'

def fetch(uri_str, limit = 10)
  # You should choose better exception.
  raise ArgumentError, 'HTTP redirect too deep' if limit == 0

  response = Net::HTTP.get_response(URI.parse(uri_str))
  case response
  when Net::HTTPSuccess     then response
  when Net::HTTPRedirection then fetch(response['location'], limit - 1)
  else
    response.error!
  end
end

print fetch('http://www.ruby-lang.org')

Again, that example is returning pages, which might slow you down. You can replace get_response with request_head, which returns a response like get_response does, which should help.

In either case, there's another thing you have to consider. A lot of sites use "meta refreshes", which cause the browser to refresh the page, using an alternate URL, after parsing the page. Handling these requires requesting the page and parsing it, looking for the <meta http-equiv="refresh" content="5" /> tags.

Other HTTP gems like Typhoeus and Patron also can do HEAD requests easily, so take a look at them too. In particular, Typhoeus can handle some heavy loads via its companion Hydra, allowing you to easily use parallel requests.


EDIT:

require 'typhoeus'

response = Typhoeus::Request.head("http://www.example.com")
response.code # => 302

case response.code
when (200 .. 299)
  #
when (300 .. 399)
  headers = Hash[*response.headers.split(/[\r\n]+/).map{ |h| h.split(' ', 2) }.flatten]
  puts "Redirected to: #{ headers['Location:'] }"
when (400 .. 499)
  #
when (500 .. 599) 
  #
end
# >> Redirected to: http://www.iana.org/domains/example/

Just in case you haven't played with one, here's what the response looks like. It's useful for exactly the sort of situation you're look at:

(rdb:1) pp response
#<Typhoeus::Response:0x00000100ac3f68
 @app_connect_time=0.0,
 @body="",
 @code=302,
 @connect_time=0.055054,
 @curl_error_message="No error",
 @curl_return_code=0,
 @effective_url="http://www.example.com",
 @headers=
  "HTTP/1.0 302 Found\r\nLocation: http://www.iana.org/domains/example/\r\nServer: BigIP\r\nConnection: Keep-Alive\r\nContent-Length: 0\r\n\r\n",
 @http_version=nil,
 @mock=false,
 @name_lookup_time=0.001436,
 @pretransfer_time=0.055058,
 @request=
  :method => :head,
    :url => http://www.example.com,
    :headers => {"User-Agent"=>"Typhoeus - http://github.com/dbalatero/typhoeus/tree/master"},
 @requested_http_method=nil,
 @requested_url=nil,
 @start_time=nil,
 @start_transfer_time=0.109741,
 @status_message=nil,
 @time=0.109822>

If you have a lot of URLs to check, see the Hydra example that is part of Typhoeus.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Thanks for the detailed reply! It's not that I am trying to use Watir to solve this specific problem..it's more like we use the Watir testing framework for front end tests, and one thing we would like to test is that this list of dynamically generated links goto real endpoints... so I suppose I could Open::URI within the Watir test framework.. – puffpio Apr 12 '11 at 04:13
  • You ought to be able to since Watir is all done in Ruby, anything running the watir code can't generally tell the difference between generic ruby, methods and classes from the Watir library, or from some other library – Chuck van der Linden Apr 12 '11 at 17:39
2

There's a bit of a philosophical debate on whether watir or watir-webdriver should provide HTTP return code information. The premise being that an ordinary 'user' which is what Watir is simulating on the DOM is ignorant of HTTP return codes. I don't necessarily agree with this, as I have a slightly different use case perhaps to the main (performance testing etc)... but it is what it is. This thread expresses some opinions about the distinction => http://groups.google.com/group/watir-general/browse_thread/thread/26486904e89340b7

At present there's no easy way to determine HTTP response codes from Watir without using supplementary tools like proxies/Fiddler/HTTPWatch/TCPdump, or downgrading to a net/http level of scripting mid test... I personally like using firebug with the netexport plugin to keep a retrospective look at tests.

Tim Koopmans
  • 788
  • 4
  • 9
  • one of the things is that I don't necessarily need a status code..I just need to verify that these dynamically generated links goto real endpoints, or that the endpoints are not erroring out etc..I thought status code would be an easy thing to check – puffpio Apr 12 '11 at 04:11
  • Unless you know what to expect at the endpoint of each link, and want to script specific watir based tests to look for that specific content on the page, I'd have to say that for your purpose, just looking at result codes would be the way to go for simple link checking. – Chuck van der Linden Apr 12 '11 at 17:40
0

All previous solutions are inefficient if you have a very huge number of links because for each one, it will establish a new HTTP connection with the server hosting the link.

I have written a one-liner bash command that will use the curl command to fetch a list of links supplied from stdin and returns a list of status codes corresponding to each link. The key point here is that curl takes all bunch of links in the same invocation and it will reuse HTTP connections which will dramatically improve speed.

However, curl will divide the list into chunks of 256, which is still by far more than 1! To make sure connections are reused, sort the links first (simply using the sort command).

cat <YOUR_LINKS_FILE_ONE_PER_LINE> | xargs curl --head --location -w '---HTTP_STATUS_CODE:%{http_code}\n\n' -s --retry 10 --globoff | grep HTTP_STATUS_CODE | cut -d: -f2 > <RESULTS_FILE>

It is worth noting that the above command will follow HTTP redirects, retry 10 times for temporary errors (timeouts or 5xx) and of course will only fetch headers.

Update: added --globoff so that curl won't expand any url if it contains {} or []

hammady
  • 969
  • 1
  • 13
  • 22