60

How would I go about checking if a URL exists using Ruby?

For example, for the URL

https://google.com

the result should be truthy, but for the URLs

https://no.such.domain

or

https://stackoverflow.com/no/such/path

the result should be falsey

Wayne Conrad
  • 103,207
  • 26
  • 155
  • 191
Shrikanth Hathwar
  • 1,140
  • 4
  • 14
  • 25
  • 9
    question was good enough to match my google search and answers are valuable – kranzky Jan 27 '17 at 04:13
  • I agree. This question is useful. – Dessa Simpson Mar 24 '17 at 01:43
  • 1
    I think this is a good question with useful answers. The reason it was closed ("must demonstrate a minimumal understanding") is no longer valid on SO. I've edited the question to add some examples. With that, I think the question can be reopened now. – Wayne Conrad Jul 08 '17 at 16:45
  • Please vote `reopen` if you think this question is good. 4 more person are required to reopen this question. I want to post an answer taking redirection into account. – ironsand Jul 08 '18 at 07:03
  • You should read this article : [Validating URL/URI in Ruby on Rails](http://www.igvita.com/2006/09/07/validating-url-in-ruby-on-rails/) – Sandro Munda May 06 '11 at 07:26

4 Answers4

74

Use the Net::HTTP library.

require "net/http"
url = URI.parse("http://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)

At this point res is a Net::HTTPResponse object containing the result of the request. You can then check the response code:

do_something_with_it(url) if res.code == "200"

Note: To check for https based url, use_ssl attribute should be true as:

require "net/http"
url = URI.parse("https://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = true
res = req.request_head(url.path)
Dhanu Gurung
  • 8,480
  • 10
  • 47
  • 60
Simone Carletti
  • 173,507
  • 49
  • 363
  • 364
  • On Production, for each and every URL this is returning me 200 code.. i have parsed ```http://www.http:/``` this URL and gave me 200 OK ...but which is wrong...What's the issue here? Any Idea? Note: This is working fine on Local Env. – Jay_Pandya Oct 06 '16 at 11:34
  • To also check the query part, as in e.g. YouTube urls, use `address = [url.path, url.query].compact.split('').flatten.join('?')` or, with Rails,`[url.path.presence || '/', url.query.presence].compact.join('?')` before doing `req.request_head(address)`. – Nic Nilov Jun 19 '18 at 08:59
62

Sorry for the late reply on this, but I think this deserves a better answer.

There are three ways to look at this question:

  1. Strict check if the URL exist
  2. Check if you are requesting the URL correctly
  3. Check if you can request it correctly and the server can answer it correctly

1. Strict check if the URL exist

While 200 means that the server answers to that URL (thus, the URL exists), answering other status code doesn't means that the URL does not exist. For example, answering 302 - redirected means that the URL exists and is redirecting to another one. While browsing, 302 many times behaves the same than 200 to the final user. Other status code that can be returned if a URL exists is 500 - internal server error. After all, if the URL does not exists, how it comes the application server processed your request instead return simply 404 - not found?

So there are actually only two cases when a URL does not exist: When the server does not exist or when the server exists but can't find the given URL path does not exist. Thus, the only way to check if the URL exists is checking if the server answers and the return code is not 404. The following code does just that.

require "net/http"
def url_exist?(url_string)
  url = URI.parse(url_string)
  req = Net::HTTP.new(url.host, url.port)
  req.use_ssl = (url.scheme == 'https')
  path = url.path if url.path.present?
  res = req.request_head(path || '/')
  res.code != "404" # false if returns 404 - not found
rescue Errno::ENOENT
  false # false if can't find the server
end

2. Check if you are requesting the URL correctly

However, most of the times we are not interested in see if a URL exists, but if we can access it. Fortunately looking to the HTTP status codes families, that is the 4xx family, which states for client error (thus, an error in your side, which means you are not requesting the page correctly, don't have permission or whatsoever). This is a good of errors to check if you can access this page. From wiki:

The 4xx class of status code is intended for cases in which the client seems to have erred. Except when responding to a HEAD request, the server should include an entity containing an explanation of the error situation, and whether it is a temporary or permanent condition. These status codes are applicable to any request method. User agents should display any included entity to the user.

So the following code make sure the URL exists and you can access it:

require "net/http"
def url_exist?(url_string)
  url = URI.parse(url_string)
  req = Net::HTTP.new(url.host, url.port)
  req.use_ssl = (url.scheme == 'https')
  path = url.path if url.path.present?
  res = req.request_head(path || '/')
  if res.kind_of?(Net::HTTPRedirection)
    url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL 
  else
    res.code[0] != "4" #false if http code starts with 4 - error on your side.
  end
rescue Errno::ENOENT
  false #false if can't find the server
end

3. Check if you can request it correctly and the server can answer it correctly

Just like the 4xx family checks if you can access the URL, the 5xx family checks if the server had any problem answering your request. An error on this family most of the times are due problems on the server itself, and hopefully they are working on solve it. If You need to be able to access the page and get a correct answer now, you should make sure the answer is not from 4xx or 5xx family, and if you was redirected, the redirected page answers correctly. So much similar to (2), you can simply use the following code:

require "net/http"
def url_exist?(url_string)
  url = URI.parse(url_string)
  req = Net::HTTP.new(url.host, url.port)
  req.use_ssl = (url.scheme == 'https')
  path = url.path if url.path.present?
  res = req.request_head(path || '/')
  if res.kind_of?(Net::HTTPRedirection)
    url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL 
  else
    ! %W(4 5).include?(res.code[0]) # Not from 4xx or 5xx families
  end
rescue Errno::ENOENT
  false #false if can't find the server
end
TomG
  • 456
  • 3
  • 11
fotanus
  • 19,618
  • 13
  • 77
  • 111
  • 2
    if you do this with https-urls you might get an `Net::HTTPBadResponse: wrong status line` error. This is because you have to tell Net:HTTP to use ssl. To make it work for https also, put a line `req.use_ssl = (url.scheme == 'https')` before calling `request_head` – Yo Ludke Jan 06 '14 at 08:29
  • @YoLudke Thank you for the contribution – fotanus Jan 06 '14 at 10:23
  • 1
    Another thing: If you request (or a redirect goes to) 'http://www.example.com' (without trailing '/'), then you get an `ArgumentError: HTTP request path is empty`. This can be addressed by changing the `res = req.request_head(url.path)` line to `path = url.path if url.path.present?` and `req.request_head(path || '/')` – Yo Ludke Jan 08 '14 at 08:49
  • @YoLudke True again, thanks! feel free to edit my answer. – fotanus Jan 08 '14 at 11:31
  • I made a gist with code that worked for me https://gist.github.com/tb/8787397 – tomaszbak Feb 03 '14 at 16:45
  • 6
    I had to add some more rescue to manage other cases: `rescue Errno::ENOENT false #false if can't find the server rescue URI::InvalidURIError false #false if URI is invalid rescue SocketError false #false if Failed to open TCP connection rescue Errno::ECONNREFUSED false #false if Failed to open TCP connection rescue Net::OpenTimeout false #false if execution expired rescue OpenSSL::SSL::SSLError false` – Camille Feb 15 '16 at 18:19
  • Just checking, but all this is perfectly safe to use with user input for the url_string, right? I mean even if it leads to a malicious site, this part of the code cannot cause any harm or security issues server-side, right? – Tashows Nov 02 '18 at 12:34
  • 1
    @Tashows it will only be unsafe if a malicious user can explot URI.parse, which has no knwon vulnerabilities as far as I know. – fotanus Nov 04 '18 at 17:59
32

Net::HTTP works but if you can work outside stdlib, Faraday is better.

Faraday.head(the_url).status == 200

(200 is a success code, assuming that's what you meant by "exists".)

Dennis
  • 56,821
  • 26
  • 143
  • 139
Turadg
  • 7,471
  • 2
  • 48
  • 49
  • 8
    Why is it better in your opinion? – Dennis Jul 04 '14 at 17:53
  • 2
    You can also use the [RestClient library](https://github.com/rest-client/rest-client). `require 'rest_client'; RestClient.head(url).code != 404` – Dennis Jul 04 '14 at 18:35
  • If you want to check for just a general "success" then you can also use `.success?`. This will return `true` for any statuses from `200` to `299`, and `false` for all other statuses. https://github.com/lostisland/faraday/search?q=SuccessfulStatuses – Joshua Pinter Apr 14 '21 at 15:13
3

Simone's answer was very helpful to me.

Here is a version that returns true/false depending on URL validity, and which handles redirects:

require 'net/http'
require 'set'

def working_url?(url, max_redirects=6)
  response = nil
  seen = Set.new
  loop do
    url = URI.parse(url)
    break if seen.include? url.to_s
    break if seen.size > max_redirects
    seen.add(url.to_s)
    response = Net::HTTP.new(url.host, url.port).request_head(url.path)
    if response.kind_of?(Net::HTTPRedirection)
      url = response['location']
    else
      break
    end
  end
  response.kind_of?(Net::HTTPSuccess) && url.to_s
end
Ryan Tate
  • 1,553
  • 2
  • 14
  • 21