Python script to see if a web page exists without downloading the whole page?

Question

I'm trying to write a script to test for the existence of a web page, would be nice if it would check without downloading the whole page.

This is my jumping off point, I've seen multiple examples use httplib in the same way, however, every site I check simply returns false.

import httplib
from httplib import HTTP
from urlparse import urlparse

def checkUrl(url):
    p = urlparse(url)
    h = HTTP(p[1])
    h.putrequest('HEAD', p[2])
    h.endheaders()
    return h.getreply()[0] == httplib.OK

if __name__=="__main__":
    print checkUrl("http://www.stackoverflow.com") # True
    print checkUrl("http://stackoverflow.com/notarealpage.html") # False

Any ideas?

Edit

Someone suggested this, but their post was deleted.. does urllib2 avoid downloading the whole page?

import urllib2

try:
    urllib2.urlopen(some_url)
    return True
except urllib2.URLError:
    return False

The second example actually exists :) http://stackoverflow.com/notarealpage.html — Gabi Purcaru, Jun 24 '11 at 17:16
No. There is an entity in the response, but the status code is clear: Not Found. It's a misconception to assume that a 404 cannot say anything (or has to have the default "boring" error message). It just means the resource you were looking for does not exist, and it turns out SO is well implemented so it gives a human-readable description for this (saying "Page Not Found"...). — Bruno, Jun 24 '11 at 17:23
I feel guilty about repeating another user's answer, so you should check out [this question](http://stackoverflow.com/questions/3229607/checking-whether-a-link-is-dead-or-not-using-python-without-downloading-the-webpa). Just as a warning, this question might be marked as duplicate because it is so similar to others, even though the question is phrased slightly differently. — cwallenpoole, Jun 24 '11 at 17:17
Be careful, some webservers (e.g. IIS in my case) do not support HEAD and can respond e.g. a 401 instead of 200, but return 200 with a GET; in that case, the fastest is to do a partial chunk download with requests's stream=True. It will do a proper GET without downloading the file. — Florent Thiery, Feb 05 '21 at 14:54

score 25 · Accepted Answer · answered Jun 24 '11 at 17:34

25

how about this:

import httplib
from urlparse import urlparse

def checkUrl(url):
    p = urlparse(url)
    conn = httplib.HTTPConnection(p.netloc)
    conn.request('HEAD', p.path)
    resp = conn.getresponse()
    return resp.status < 400

if __name__ == '__main__':
    print checkUrl('http://www.stackoverflow.com') # True
    print checkUrl('http://stackoverflow.com/notarealpage.html') # False

this will send an HTTP HEAD request and return True if the response status code is < 400.

notice that StackOverflow's root path returns a redirect (301), not a 200 OK.

answered Jun 24 '11 at 17:34

Corey Goldberg

59,062
28
129
143

4

had to make come changes for python3. import urllib.parse as urlparse and import httplib2. Instead of HTTPConnection it was HTTPConnectionWithTimeout. Instead of urlparse, it was urlparse.urlparse. – Kabira K Jan 16 '18 at 20:49
An HTTP 401 or 403 could be returned but the URL may exist. – Raj May 02 '21 at 17:11

score 17 · Answer 2 · answered Apr 08 '16 at 17:44

Using requests, this is as simple as:

import requests

ret = requests.head('http://www.example.com')
print(ret.status_code)

This just loads the website's header. To test if this was successfull, you can check the results status_code. Or use the raise_for_status method which raises an Exception if the connection was not succesfull.

score 6 · Answer 3 · edited Jan 17 '18 at 02:32

6

How about this.

import requests

def url_check(url):
    #Description

    """Boolean return - check to see if the site exists.
       This function takes a url as input and then it requests the site 
       head - not the full html and then it checks the response to see if 
       it's less than 400. If it is less than 400 it will return TRUE 
       else it will return False.
    """
    try:
            site_ping = requests.head(url)
            if site_ping.status_code < 400:
                #  To view the return status code, type this   :   **print(site.ping.status_code)** 
                return True
            else:
                return False
    except Exception:
        return False

edited Jan 17 '18 at 02:32

Kabira K

1,916
2
22
38

answered Apr 07 '17 at 00:03

Josh

61
1
3

You should add a description of your code. It will help future visitors that will view this answer, and it will help OP. – BusyProgrammer Apr 07 '17 at 00:51
404 doesn't throws an exception. Needs a return False for else. – Kabira K Jan 16 '18 at 20:53

score -2 · Answer 4 · edited Apr 08 '16 at 18:57

-2

You can try

import urllib2

try:
    urllib2.urlopen(url='https://someURL')
except:
    print("page not found")

edited Apr 08 '16 at 18:57

cat

3,888
5
32
61

answered Apr 08 '16 at 17:35

Suliman lee

1

3

urlopen downloads the whole page, which is what OP is trying to avoid. – Darrick Herwehe Apr 08 '16 at 18:19

Python script to see if a web page exists without downloading the whole page?

4 Answers4

Linked