-1

Is there a Ruby gem, or Ruby-esque way to check a webpage for broken links without crawling the actual links and checking for 404's, etc. Basically, I want a solution that works offline, and I want to detect links that are obviously syntactically broken, not links that point to web pages that don't exist.

So for instance, if a link points to "http//stackoverflow.com", that's a syntactically broken link, and I want to detect that. However if a link points to "http://www.webpagedoesnotexistyet.com" and it returns a 404, I'm OK with not detecting that.

Henley
  • 21,258
  • 32
  • 119
  • 207
  • 1
    Sounds like you want to use regex. I'd check out this post: http://stackoverflow.com/questions/4716513/ruby-regular-expression-to-match-a-url – Jonathan Bender Oct 30 '13 at 16:50
  • What logic applies when relative links like `/tags` or `/users` are detected? – Anand Shah Oct 30 '13 at 16:53
  • You'd still be using regex, first to find all `a` tags, then checking the `href` to make sure that it's either a valid full URL or begins with a `/` and doesn't include any invalid characters before the close quote. See this post: http://stackoverflow.com/questions/499345/regular-expression-to-extract-url-from-an-html-link – Jonathan Bender Oct 30 '13 at 17:25
  • [Not a regex alone](http://stackoverflow.com/questions/8577060/why-is-it-such-a-bad-idea-to-parse-xml-with-regex), but an XML parser to detect the `href` attribute for `` tags, which then get passed through a regex. – PinnyM Oct 30 '13 at 17:31
  • 1
    And of course, the obligatory link to [this post](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)... – PinnyM Oct 30 '13 at 17:38

3 Answers3

0

Use nokogiri to parse the HTML and URI.parse to check for valid URLs. URI will raise an error if it encounters what it considers to be an invalid url.

Philip Hallstrom
  • 19,673
  • 2
  • 42
  • 46
0

Use this : Links below is an array of links

for link in links do
    begin
        url = URI.parse(link)
        req = Net::HTTP.new(url.host, url.port)
        res = req.request_head(url.path)

        if res.code == "200"
            puts "#{res.code} ok - #{link}"
        else
            puts "#{res.code} error - #{link}"
        end
    rescue
        puts "breaking for #{link}"
    end
end
0

You can use URI.regexp. If a string matches it, it is a valid uri.

require 'uri'

def valid_uri?(s)
  !!(s =~ URI.regexp)
end


valid_uri?('http//stackoverflow.com') # => false
valid_uri?('http://www.webpagedoesnotexistyet.com/') # => true
Sergio Tulentsev
  • 226,338
  • 43
  • 373
  • 367