82

I wanted to check if a certain website exists, this is what I'm doing:

user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com"
req = urllib2.Request(link, headers = headers)
page = urllib2.urlopen(req).read() - ERROR 402 generated here!

If the page doesn't exist (error 402, or whatever other errors), what can I do in the page = ... line to make sure that the page I'm reading does exit?

Blender
  • 289,723
  • 53
  • 439
  • 496
James Hallen
  • 4,534
  • 4
  • 23
  • 28
  • 3
    How 'bout an if check to only read if you get 200? – duffymo May 27 '13 at 18:09
  • Does this answer your question? [Python script to see if a web page exists without downloading the whole page?](https://stackoverflow.com/questions/6471275/python-script-to-see-if-a-web-page-exists-without-downloading-the-whole-page) – PhoneixS Apr 20 '22 at 09:43

10 Answers10

152

You can use HEAD request instead of GET. It will only download the header, but not the content. Then you can check the response status from the headers.

For python 2.7.x, you can use httplib:

import httplib
c = httplib.HTTPConnection('www.example.com')
c.request("HEAD", '')
if c.getresponse().status == 200:
   print('web site exists')

or urllib2:

import urllib2
try:
    urllib2.urlopen('http://www.example.com/some_page')
except urllib2.HTTPError, e:
    print(e.code)
except urllib2.URLError, e:
    print(e.args)

or for 2.7 and 3.x, you can install requests

import requests
response = requests.get('http://www.example.com')
if response.status_code == 200:
    print('Web site exists')
else:
    print('Web site does not exist') 
Adem Öztaş
  • 20,457
  • 4
  • 34
  • 42
  • 1
    Note that `www.abc.com` returns a 301 (Moved) [status code](http://www.w3.org/Protocols/HTTP/HTRESP.html). – unutbu May 27 '13 at 18:18
  • Can I pass a html link to `.HTTPConnection()` like `http:\\www.abc.com\x\y\z.html` – James Hallen May 27 '13 at 18:32
  • @JamesHallen you can use urllib2, – Adem Öztaş May 27 '13 at 18:42
  • 7
    Note that a HEAD request may fail even though the URL exists. Amazon, for example, returns status 405 (Method Not Allowed) for its front page. An additional GET may be needed in that case. – efotinis May 27 '13 at 19:44
  • 1
    This does not work in general. When I request a page that does not exists I get response code 200 and a page with the following content: `` I tried this with my vps host and the response is to redirect to gen.xyz. When scraping webpages I would like to handle this behavior in a dependable way. – kalu Aug 06 '14 at 04:52
  • 16
    I'm not sure what the old `requests` module is like but now, `requests.head` is the function to use instead of `requests.get`. – Moon Cheesez Jun 28 '16 at 15:34
  • 6
    @AdemÖztaş, using `requests` if particular website is not available then it throws `requests.exceptions.ConnectionError`. – Piyush S. Wanare Jan 05 '17 at 08:00
  • 3
    This answer is wrong. There are many other codes than 200 that sites return. Also this does not handle errors that come up going trough long lists of sites. – mikkokotila May 18 '18 at 10:33
  • @AdemÖztaş Thanks. What about single page applications that are very common in recent years? Do you know any good ways to check if a SPA url exists? – Jun Jan 17 '20 at 02:22
  • 1
    The requests.get() function returns a response, so naming the variable "response" instead of "request" would be more appropriate. – KarelHusa Aug 23 '21 at 11:08
51

It's better to check that status code is < 400, like it was done here. Here is what do status codes mean (taken from wikipedia):

  • 1xx - informational
  • 2xx - success
  • 3xx - redirection
  • 4xx - client error
  • 5xx - server error

If you want to check if page exists and don't want to download the whole page, you should use Head Request:

import httplib2
h = httplib2.Http()
resp = h.request("http://www.google.com", 'HEAD')
assert int(resp[0]['status']) < 400

taken from this answer.

If you want to download the whole page, just make a normal request and check the status code. Example using requests:

import requests

response = requests.get('http://google.com')
assert response.status_code < 400

See also similar topics:

vvvvv
  • 25,404
  • 19
  • 49
  • 81
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I actually did want to download the page, but this was a preliminary step to see if the page existed – James Hallen May 27 '13 at 18:37
  • Is there anything wrong with parsing this link: `http://www.cmegroup.com/trading/energy/electricity/caiso-sp15-ez-gen-hub-5-mw-peak-calendar-month-day-ahead-lmp-swap-futures_contract_specifications.html` ? – James Hallen May 27 '13 at 18:54
  • The link you've provided has invalid character inside. The correct link is http://www.cmegroup.com/trading/energy/electricity/caiso-sp15-ez-gen-hub-5-mw-peak-calendar-month-day-ahead-lmp-swap-futures_contract_specifications.html. Just replace `http://google.com` with it in my examples and it'll work. – alecxe May 27 '13 at 18:58
  • Okay, thanks for that, please check the answer by `alexce` it works well too. – James Hallen May 27 '13 at 19:01
9
from urllib2 import Request, urlopen, HTTPError, URLError

user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com/"
req = Request(link, headers = headers)
try:
        page_open = urlopen(req)
except HTTPError, e:
        print e.code
except URLError, e:
        print e.reason
else:
        print 'ok'

To answer the comment of unutbu:

Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range. Source

keas
  • 126
  • 1
  • 4
8

There is an excellent answer provided by @Adem Öztaş, for use with httplib and urllib2. For requests, if the question is strictly about resource existence, then the answer can be improved upon in the case of large resource existence.

The previous answer for requests suggested something like the following:

def uri_exists_get(uri: str) -> bool:
    try:
        response = requests.get(uri)
        try:
            response.raise_for_status()
            return True
        except requests.exceptions.HTTPError:
            return False
    except requests.exceptions.ConnectionError:
        return False

requests.get attempts to pull the entire resource at once, so for large media files, the above snippet would attempt to pull the entire media into memory. To solve this, we can stream the response.

def uri_exists_stream(uri: str) -> bool:
    try:
        with requests.get(uri, stream=True) as response:
            try:
                response.raise_for_status()
                return True
            except requests.exceptions.HTTPError:
                return False
    except requests.exceptions.ConnectionError:
        return False

I ran the above snippets with timers attached against two web resources:

1) http://bbb3d.renderfarming.net/download.html, a very light html page

2) http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4, a decently sized video file

Timing results below:

uri_exists_get("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.611239

uri_exists_stream("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.000007

uri_exists_get("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:01:12.813224

uri_exists_stream("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:00:00.000007

As a last note: this function also works in the case that the resource host doesn't exist. For example "http://abcdefghblahblah.com/test.mp4" will return False.

Maxfield
  • 81
  • 1
  • 4
7

I see many answers that use requests.get, but I suggest you this solution using only requests.head which is faster and also better for the webserver since it doesn't need to send back the body too.

import requests

def check_url_exists(url: str):
    """
    Checks if a url exists
    :param url: url to check
    :return: True if the url exists, false otherwise.
    """
    return requests.head(url, allow_redirects=True).status_code == 200

The meta-information contained in the HTTP headers in response to a HEAD request should be identical to the information sent in response to a GET request.

Gerardo Zinno
  • 1,518
  • 1
  • 13
  • 35
5

code:

a="http://www.example.com"
try:    
    print urllib.urlopen(a)
except:
    print a+"  site does not exist"
The6thSense
  • 8,103
  • 8
  • 31
  • 65
Raj
  • 419
  • 1
  • 4
  • 10
5

You can simply use stream method to not download the full file. As in latest Python3 you won't get urllib2. It's best to use proven request method. This simple function will solve your problem.

def uri_exists(url):
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        return True
    else:
        return False
rusty
  • 652
  • 7
  • 21
4
def isok(mypath):
    try:
        thepage = urllib.request.urlopen(mypath)
    except HTTPError as e:
        return 0
    except URLError as e:
        return 0
    else:
        return 1
  • 3
    Consider adding a description with your code; merely posting code does not help the community as it does not help them understand how it works. In order to attract upvotes from the community, consider adding some details of how your code works. – BusyProgrammer Mar 26 '17 at 17:49
  • 2
    I think more than one understood my code, but you're right. Thanks for the feedback! – DiegoPacheco Mar 27 '17 at 00:45
1

Try this one::

import urllib2  
website='https://www.allyourmusic.com'  
try:  
    response = urllib2.urlopen(website)  
    if response.code==200:  
        print("site exists!")  
    else:  
        print("site doesn't exists!")  
except urllib2.HTTPError, e:  
    print(e.code)  
except urllib2.URLError, e:  
    print(e.args)  
Vishal Kumar
  • 1,290
  • 14
  • 15
0

For those who want to check if URL is accessible for POST request but don't want to send any actual data to the API I recommend using the following approach:

import requests

url = 'https://www.example.com'

try:
    response = requests.options(url)
    if response.ok:   # alternatively you can use response.status_code == 200
         print("Success - API is accessible.")
    else:
        print(f"Failure - API is accessible but sth is not right. Response codde : {response.status_code}")
except (requests.exceptions.HTTPError, requests.exceptions.ConnectionError) as e:
    print(f"Failure - Unable to establish connection: {e}.")
except Exception as e:
    print(f"Failure - Unknown error occurred: {e}.)

Using GET request to check if POST Endpoint exists would result in HTTP 405 – Method Not Allowed, which is a bit problematic, while using requests.options() returns HTTP 200.

Kecz
  • 90
  • 7