9

I'm trying to create a basic link checker in python.

When using the following code:

def get_link_response_code(link_to_check):  
    resp = requests.get(link_to_check)
    return resp.status_code

I'm always getting the right response code but it takes considerable ammount of time.

But when using this code: (requests.get replaced with requests.head)

def get_link_response_code(link_to_check):  
    resp = requests.head(link_to_check)
    return resp.status_code

It usually works, and very fast, but sometimes return HTTP 405 (for a link which is not really broken).

Why am I getting 405 (wrong method) errors? what can I do to quickly check for broken links? Thanks.

tomermes
  • 22,950
  • 16
  • 43
  • 67
  • This [link](http://stackoverflow.com/questions/16539269/http-head-vs-get-perfomances) would be useful – Hana Bzh Jan 04 '15 at 07:41
  • It looks like one of the proxies/servers on the "current" route to that (valid!) resource is configured not to accept the `HEAD` method. Nothing to do with the code itself... – Ron Klein Jan 04 '15 at 08:49

3 Answers3

9

According to the specification, 405 means that Method not allowed which means that you cannot use HEAD for this particular resource.

Handle it and use get() in these cases:

def get_link_response_code(link_to_check):
    resp = requests.head(link_to_check)
    if resp.status_code == 405:
        resp = requests.get(link_to_check)
    return resp.status_code

As a side note, you may not need to make an additional get() since 405 is kind of a "good" error - the resource exists, but not available with HEAD. You may also check the Allow response header value which must be set in response from your HEAD request:

The Allow entity-header field lists the set of methods supported by the resource identified by the Request-URI. The purpose of this field is strictly to inform the recipient of valid methods associated with the resource. An Allow header field MUST be present in a 405 (Method Not Allowed) response.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • as a side note, servers which configured to disable HEAD method are generally bad practices. – Anzel Jan 04 '15 at 07:37
  • Great answer, thanks. I would use your change of the code but your side note is wrong - I tried going to another page on this "blocking" website, i.e: www.domain-with-405.com/non-existent/ and by browser I'm getting a 404 error, but from the code I still get 405. So if I want to check if a specific page exists I must use the get function in those cases. Thanks again. – tomermes Jan 04 '15 at 10:56
2

For requests.get your are getting the info correctly because the GET method means retrieve whatever information (in the form of an entity) is identified by the Request-URI while the requests.Head the server doesn't return message body the in the response.

Please note that the HEAD method is identical to GET except that the server MUST NOT return a message-body in the response.

Seroney
  • 805
  • 8
  • 26
0

If you are trying to Crawl some webpage, your request maybe GET method and it should return 200 if it OK, but maybe some conf not allow the GET method from program for some season, you can just add some code like this:

def get_link_response_code(link_to_check):
  try:
    resp = requests.head(link_to_check)
    if resp.status_code != 200:
      print "error"
    else:
      reutrun resp.status_code
  except Exception,error:
    print error

  return None

hope that helps!

lqhcpsgbl
  • 3,694
  • 3
  • 21
  • 30