3

I'm trying to check a webpage's status with Python. I've done timers to test, but nothing yields anything much better than the other. The worst and best only differ by 20%. I really just need the response code, not the HTML source. There are 3 response codes that I will handle: 200, 403, 404.

Method 1 is mine but the others were found here: Checking if a website is up via Python

Method 1: Right now, I'm using mechanize to open the URL with a try and except. If it's a 200, it'll go through fine but if it's 403/404, it'll run the except. This works fine, but it's not very fast. The average speed is 0.00276

Method 2: Using urllib, I get about the same time as python. The average speed is 0.00227. Here's the code for that, it's just a one liner.

print urllib.urlopen("http://www.stackoverflow.com").getcode()

Method 3: I think this httplib method would be fastest, but it only checks domains, it doesn't check individual pages of a domain, so it didn't work in my case. The code for that is:

conn = httplib.HTTPConnection("www.python.org")
conn.request("HEAD", "/")
r1 = conn.getresponse()
print r1.status, r1.reason

Method 4: This method uses requests.head and it has an average speed of 0.00246. The code is:

r = requests.head("http://www.stackoverflow.com")
print r

Does anyone know a more efficient way of checking a webpage's status in python?

Community
  • 1
  • 1
User
  • 23,729
  • 38
  • 124
  • 207
  • 1
    What do you mean by speed, as in "it has an average speed of 0.00246"? Is that how long it takes to fetch the page? What are the units? – stephenbez Jan 09 '14 at 22:18
  • 1
    How is 2 milliseconds 'not very fast'? – yuvi Jan 09 '14 at 22:18
  • 2
    What is your objective? any gain you get in one method over another is going to be trivial compared to network time – norlesh Jan 09 '14 at 22:20
  • 4
    Every method will send the same verb (HEAD) using the HTTP protocol and as it basically only implies to open a socket and send HEAD / (very few bytes) to also get very few bytes, the execution time essentially depends on your network latency. I doubt you can improve anything on the Python side. – Raphaël Braud Jan 09 '14 at 22:23
  • 1
    FWIW, you can certainly pass any path you like in the request in #3 as well. But +1 to @RaphaelBraud's comment. – tripleee Jan 09 '14 at 22:26
  • Thanks to those who answer the question above. +1. For the others that are trying to change my interest to network time, my server's download speed is 1 GBPS. – User Jan 09 '14 at 22:38
  • Download speed is only part of the equation. The real problem is latency. Even with a local DNS cache and a HTTP proxy, you cannot reduce the number of TCP round-trips (and these techniques might actually mask out the problems you presumably want to detect); probably you cannot improve what you already have to the point where the difference would be significant (i.e. worth the effort). If you need to do this on a massive scale, run several jobs in parallel. If not, why do you worry about the speed at all? – tripleee Jan 10 '14 at 07:38
  • I do it on a scale of 200k URLs a day, but want to increase to 10 times that. Found out the big delay is actually MySQL, not the page request. – User Jan 10 '14 at 14:03
  • What do you mean by fastest ,whatever you use basically that will open a socket connection to server. So it doesn't matter for only retrieving response. – Abhishek Jan 13 '14 at 08:44

1 Answers1

3

The three libraries you've mentioned pretty well cover all the immediate options. Requests.py could be a #4 candidate.

Note that Mechanize wraps URLLib2 while Requests makes use of URLLib3.

As the comments on the question, these are all mature libraries so it's probably unlikely you'll find performance improvements in other libraries or by re-implementing yourself.

Still, if that's your goal then that's probably the direction to head.

Dwight Gunning
  • 2,485
  • 25
  • 39