6

For those who know wget, it has a option --spider, which allows one to check whether a link is broke or not, without actually downloading the webpage. I would like to do the same thing in Python. My problem is that I have a list of 100'000 links I want to check, at most once a day, and at least once a week. In any case this will generate a lot of unnecessary traffic.

As far as I understand from the urllib2.urlopen() documentation, it does not download the page but only the meta-information. Is this correct? Or is there some other way to do this in a nice manner?

Best,
Troels

SilentGhost
  • 307,395
  • 66
  • 306
  • 293
Troels
  • 63
  • 1
  • 3

2 Answers2

9

You should use the HEAD Request for this, it asks the webserver for the headers without the body. See How do you send a HEAD HTTP request in Python 2?

Community
  • 1
  • 1
Jochen Ritzel
  • 104,512
  • 31
  • 200
  • 194
  • Right, HEAD will get you the headers (including HTTP status) without downloading the body of the message. Some sites are (mis)configured to send 'not found'/404 pages with a status of 200, though, so it would be hard to detect those situations. – JAL Jul 12 '10 at 15:26
  • As far as I can tell this is what wget --spider does. – Kathy Van Stone Jul 12 '10 at 16:06
  • Thanks a lot for the solution as well as the thoughts on misconfigured sites (that is worth keeping in mind!) - that is just what I need :) – Troels Jul 12 '10 at 19:33
  • Will this also work for URL to files like zip or png? – Michał Kawiecki May 11 '18 at 14:24
-1

Not sure how to do this in python but generally you could check 'Response Header' and check 'Status-Code' for code 200. at that point you could stop reading the page and continue with your next link that way you don't have to download the whole page just the 'Response Header' List of Status Codes

Greg
  • 1,671
  • 2
  • 15
  • 30
  • Why this have been down voted, please explain your reasoning? I know that this not use the Head request but it accomplished the same thing. – Greg Jul 12 '10 at 16:57
  • 301 is a redirect and is a good response as well. Actually, any 2** is OK, 3** need further processing (redirect), etc. Checking only for 200 is insufficient. – kgadek Jan 10 '14 at 16:12