how can i determine if anything at the given url does exist

Question

how can i determine if anything at the given url does exist in the web using python? it can be a html page or a pdf file, shouldnt be matter. ive tried the solution written in this page http://code.activestate.com/recipes/101276/ but it just returns a 1 when its a pdf file or anything.

The code you linked to seems to do what you want? – Can Berk Güder Dec 27 '09 at 14:43 — Can Berk Güder, Dec 27 '09 at 14:43

sastanin · Accepted Answer · 2009-12-27T15:08:48.247

16

You need to check HTTP response code. Python example:

from urllib2 import urlopen
code = urlopen("http://example.com/").code

4xx and 5xx code probably mean that you cannot get anything from this URL. 4xx status codes describe client errors (like "404 Not found") and 5xx status codes describe server errors (like "500 Internal server error"):

if (code / 100 >= 4):
   print "Nothing there."

Links:

edited Dec 27 '09 at 15:08

answered Dec 27 '09 at 15:03

sastanin

40,473
13
103
130

2

`urlopen` sends a `GET` request, and the server will return the whole content for that URL. Personally I think use `HTTPConnection`/`HTTPSConnection` to build a `HEAD` request is better, which will save a lot net traffic. – iamamac Dec 27 '09 at 15:36
Good point. I agree that `HEAD` is often better and will save traffic. So I upvoted Yacoby's answer. However, `urllib2.urlopen` is radically easy to use and saves lines of code (no need to split URL into server/path pair, for instance). `GET` costs may be acceptable in many cases. – sastanin Dec 27 '09 at 16:14
Probably You should avoid redirections codes also. – Fedir RYKHTIK Jun 28 '13 at 09:34

score 9 · Answer 2 · answered Dec 27 '09 at 14:42

9

Send a HEAD request

import httplib 
connection = httplib.HTTPConnection(url) 
connection.request('HEAD', '/') 
response = connection.getresponse() 
if response.status == 200:
    print "Resource exists"

answered Dec 27 '09 at 14:42

Yacoby

54,544
15
116
120

i get an error for this answer Traceback (most recent call last): File "/home/cad/eclipse/wsp/pyCrawler/src/pyCrawler/pyCrawler.py", line 63, in print httpExists('http://sstatic.net/so/all.css?v=5912') File "/home/cad/eclipse/wsp/pyCrawler/src/pyCrawler/pyCrawler.py", line 25, in httpExists c = httplib.HTTPConnection(url) File "/usr/lib64/python2.6/httplib.py", line 656, in __init__ self._set_hostport(host, port) File "/usr/lib64/python2.6/httplib.py", line 668, in _set_hostport ..... – xzvkm Dec 27 '09 at 14:50
For the record, this one does use HTTP/1.1 and so does work on Slashdot. – Josh Lee Dec 27 '09 at 15:03

score 2 · Answer 3 · edited Jun 20 '20 at 09:12

2

The httplib in that example is using HTTP/1.0 instead of 1.1, and as such Slashdot is returning a status code 301 instead of 200. I would recommend using urllib2, and also probably checking for codes 20* and 30*.

The documentation for httplib states:

It is normally not used directly — the module urllib uses it to handle URLs that use HTTP and HTTPS.

[...]

The HTTP class is retained only for backward compatibility with 1.5.2. It should not be used in new code. Refer to the online docstrings for usage.

So yes. urllib is the way to open URLs in Python — an HTTP/1.0 client won't get very far on modern web servers.

(Also, a PDF link works for me.)

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 27 '09 at 14:44

Josh Lee

171,072
38
269
275

1

It has nothing to do with the version of HTTP. Actually it is because `httplib.HTTP` skipped the host field in the request. You can test by: $ telnet slashdot.org 80 HEAD / HTTP/1.0 Host: slashdot.org – iamamac Dec 27 '09 at 15:22
What I really meant is that a 1.0 request isn't *required* to include the host header. – Josh Lee Dec 27 '09 at 15:50

score 0 · Answer 4 · answered Dec 27 '09 at 14:45

0

This solution returns 1 because server is sending 200 OK response.

There's something wrong with your server. It should return 404 if the file doesn't exist.

answered Dec 27 '09 at 14:45

myfreeweb

971
1
13
19

how can i determine if anything at the given url does exist

4 Answers4