I have a list of urls (1000+) which have been stored for over a year now. I want to run through and verify them all to see if they still exist. What is the best / quickest way to check them all and return a list of ones which do not return a site?
Asked
Active
Viewed 8,170 times
2 Answers
11
this is kind of slow but you can use something like this to check if url is a live
import urllib2
try:
urllib2.urlopen(url)
return True # URL Exist
except ValueError, ex:
return False # URL not well formatted
except urllib2.URLError, ex:
return False # URL don't seem to be alive
more quick than urllib2 you can use httplib
import httplib
try:
a = httplib.HTTPConnection('google.com')
a.connect()
except httplib.HTTPException as ex:
print "not connected"
you can also do a DNS checkout (it's not very convenient to check if a website don't exist):
import socket
try:
socket.gethostbyname('www.google.com')
except socket.gaierror as ex:
print "not existe"

mouad
- 67,571
- 18
- 114
- 106
-
is using socket faster than urllib2. I tried urllib2 but it took forever so I ended up stopping it – John Oct 28 '10 at 15:31
-
i just edited my question , and i added a more quick solution using httplib , and for using ping (the other answer) or dns lookup(the third solution in my answer) is not very convenient, because many web site are still registered in the DNS and they don't exist anymore and for the ping it just like the DNS lookup + a ICMP ping which also don't say if the website (http server) is running "accepting connection" or not – mouad Oct 28 '10 at 17:07
-
1The `urllib2` one worked for me from behind a proxy on OS X. `httplib` would not work. – Kyle Falconer Jun 06 '16 at 22:45
0
Check this:
End then:
import ping, socket
try:
result = ping.do_one('http://stackoverflow.com/', timeout=2)
except socket.error, e:
# url cannot be reached
print "Error:", e

Klark
- 8,162
- 3
- 37
- 61
-
I have over a 1000 urls to check. will this be faster than using the urllib2 answer below? – John Oct 28 '10 at 15:30
-
I think it will. Test it. It also depends on the network. In every case it will take some time for server to response (you can set timeout in my solution, as you can see in the code). – Klark Oct 28 '10 at 15:42