118

What I'm trying to do here is get the headers of a given URL so I can determine the MIME type. I want to be able to see if http://somedomain/foo/ will return an HTML document or a JPEG image for example. Thus, I need to figure out how to send a HEAD request so that I can read the MIME type without having to download the content. Does anyone know of an easy way of doing this?

Anthony Geoghegan
  • 11,533
  • 5
  • 49
  • 56
fuentesjr
  • 50,920
  • 27
  • 77
  • 81

11 Answers11

109

urllib2 can be used to perform a HEAD request. This is a little nicer than using httplib since urllib2 parses the URL for you instead of requiring you to split the URL into host name and path.

>>> import urllib2
>>> class HeadRequest(urllib2.Request):
...     def get_method(self):
...         return "HEAD"
... 
>>> response = urllib2.urlopen(HeadRequest("http://google.com/index.html"))

Headers are available via response.info() as before. Interestingly, you can find the URL that you were redirected to:

>>> print response.geturl()
http://www.google.com.au/index.html
Anthony Geoghegan
  • 11,533
  • 5
  • 49
  • 56
doshea
  • 1,567
  • 1
  • 11
  • 12
  • 1
    response.info().__str__() will return string format of the header, in case you want to do something with the result you get. – Shane Oct 12 '10 at 12:17
  • 6
    except that trying this with python 2.7.1 (ubuntu natty), if there's a redirect, it does a GET on the destination, not a HEAD... – eichin Aug 23 '11 at 04:37
  • 1
    That's the advantage of the `httplib.HTTPConnection`, which doesn't handle redirects automatically. – Ehtesh Choudhury Oct 04 '11 at 06:59
  • but with doshea's answer. how to set the timeout? How to handle bad URLs, i.e., URLs that are not no longer alive. – fanchyna Aug 19 '13 at 17:31
105

edit: This answer works, but nowadays you should just use the requests library as mentioned by other answers below.


Use httplib.

>>> import httplib
>>> conn = httplib.HTTPConnection("www.google.com")
>>> conn.request("HEAD", "/index.html")
>>> res = conn.getresponse()
>>> print res.status, res.reason
200 OK
>>> print res.getheaders()
[('content-length', '0'), ('expires', '-1'), ('server', 'gws'), ('cache-control', 'private, max-age=0'), ('date', 'Sat, 20 Sep 2008 06:43:36 GMT'), ('content-type', 'text/html; charset=ISO-8859-1')]

There's also a getheader(name) to get a specific header.

Anthony Geoghegan
  • 11,533
  • 5
  • 49
  • 56
Eevee
  • 47,412
  • 11
  • 95
  • 127
75

Obligatory Requests way:

import requests

resp = requests.head("http://www.google.com")
print resp.status_code, resp.text, resp.headers
K Z
  • 29,661
  • 8
  • 73
  • 78
36

I believe the Requests library should be mentioned as well.

Brad Koch
  • 19,267
  • 19
  • 110
  • 137
daliusd
  • 1,025
  • 11
  • 15
  • 5
    This answer deserves more attention. Looks like a pretty good library that makes the problem trivial. – Nick Retallack Oct 27 '11 at 00:00
  • 3
    I agree It was very simple to make requests: {code} import requests r = requests.head('http://github.com') {code} – Luis R. Nov 17 '11 at 19:45
  • @LuisR.: if there is a redirect then it follows GET/POST/PUT/DELETE also. – jfs Feb 10 '12 at 13:40
  • @Nick Retallack: there is no easy way to disable redirects. `allow_redirects` can disable only POST/PUT/DELETE redirects. Example: [head request no redirect](http://hastebin.com/hokutehopu.py) – jfs Feb 10 '12 at 14:01
  • @J.F.Sebastian The link to your example seems to be broken. Could you elaborate on the issue with following redirects? – Piotr Dobrogost Aug 30 '12 at 18:13
  • @Piotr: The issue was that `requests.head(URL)` didn't stop on redirect and made additional GET requests. Current version 0.13.9 doesn't do it anymore (at least for 301, 302 redirects). – jfs Aug 30 '12 at 20:32
  • @J.F.Sebastian It seems it was fixed in revision [6f57352](https://github.com/kennethreitz/requests/commit/6f5735274b9ce2c61345adf8d7657b01b1623320) – Piotr Dobrogost Aug 31 '12 at 07:17
  • @Piotr: something else also had changed. As I said above `allow_redirects` worked by enabling redirect for POST i.e., `allow_redirects` had no effect on HEAD. – jfs Aug 31 '12 at 13:47
  • check this: http://stackoverflow.com/questions/2018026/should-i-use-urllib-or-urllib2-or-requests for a comparison of the different libs that could be used for this. Requests seems to be the most popular. – brita_ Apr 29 '14 at 14:08
17

Just:

import urllib2
request = urllib2.Request('http://localhost:8080')
request.get_method = lambda : 'HEAD'

response = urllib2.urlopen(request)
response.info().gettype()

Edit: I've just came to realize there is httplib2 :D

import httplib2
h = httplib2.Http()
resp = h.request("http://www.google.com", 'HEAD')
assert resp[0]['status'] == 200
assert resp[0]['content-type'] == 'text/html'
...

link text

ecstaticpeon
  • 568
  • 4
  • 7
Paweł Prażak
  • 3,091
  • 1
  • 27
  • 42
  • Slightly nasty in that you're leaving get_method as an unbound function rather than binding it to `request`. (Viz, it'll work but it's bad style and if you wanted to use `self` in it - tough.) – Chris Morgan Dec 12 '10 at 12:53
  • 4
    Could you elaborate a bit more about pros and cons of this solution? I'm not an Python expert as you can see, so I could benefit knowing when it can turn bad ;) As fas as I understand the concern is that it's a hack that may or may not work depending on implementation change? – Paweł Prażak Dec 12 '10 at 13:54
  • This second version in this code is the only one that worked for me for a URL with a 403 Forbidden. Others were throwing an exception. – duality_ Apr 11 '13 at 15:16
12

For completeness to have a Python3 answer equivalent to the accepted answer using httplib.

It is basically the same code just that the library isn't called httplib anymore but http.client

from http.client import HTTPConnection

conn = HTTPConnection('www.google.com')
conn.request('HEAD', '/index.html')
res = conn.getresponse()

print(res.status, res.reason)
Octavian Helm
  • 39,405
  • 19
  • 98
  • 102
2
import httplib
import urlparse

def unshorten_url(url):
    parsed = urlparse.urlparse(url)
    h = httplib.HTTPConnection(parsed.netloc)
    h.request('HEAD', parsed.path)
    response = h.getresponse()
    if response.status/100 == 3 and response.getheader('Location'):
        return response.getheader('Location')
    else:
        return url
jcomeau_ictx
  • 37,688
  • 6
  • 92
  • 107
  • What are the dollar-signs before `import`? +1 for the `urlparse` - together with `httplib` they give the comfort of `urllib2`, when dealing with URLs on the input side. – Tomasz Gandor Jan 10 '13 at 10:47
1

I have found that httplib is slightly faster than urllib2. I timed two programs - one using httplib and the other using urllib2 - sending HEAD requests to 10,000 URL's. The httplib one was faster by several minutes. httplib's total stats were: real 6m21.334s user 0m2.124s sys 0m16.372s

And urllib2's total stats were: real 9m1.380s user 0m16.666s sys 0m28.565s

Does anybody else have input on this?

IgorGanapolsky
  • 26,189
  • 23
  • 116
  • 147
  • Input? The problem is IO-bound and you're using blocking libraries. Switch to eventlet or twisted if you want better performance. The limitations of urllib2 you mention are CPU-bound. – Devin Jeanpierre Aug 13 '10 at 01:04
  • 3
    urllib2 follows redirects, so if some of your URLs redirect, that will probably be the reason for the difference. And, httplib is more low-level, urllib2 does parse the url for example. – Marian Aug 25 '10 at 22:05
  • 1
    urllib2 is just a thin layer of abstraction on top of httplib, I'd be very surprised if you were cpu bound unless the urls are on a very fast LAN. Is it possible some of the urls were redirects? urllib2 will follow the redirects whereas httplib would not. The other possibility is that the network conditions ( anything you don't have explicit control of in this experiment ) fluctuated between the 2 runs. you should do at least 3 interleaved runs of each to reduce this likelyhood – John La Rooy Feb 20 '11 at 20:30
1

As an aside, when using the httplib (at least on 2.5.2), trying to read the response of a HEAD request will block (on readline) and subsequently fail. If you do not issue read on the response, you are unable to send another request on the connection, you will need to open a new one. Or accept a long delay between requests.

0

And yet another approach (similar to Pawel answer):

import urllib2
import types

request = urllib2.Request('http://localhost:8080')
request.get_method = types.MethodType(lambda self: 'HEAD', request, request.__class__)

Just to avoid having unbounded methods at instance level.

estani
  • 24,254
  • 2
  • 93
  • 76
-4

Probably easier: use urllib or urllib2.

>>> import urllib
>>> f = urllib.urlopen('http://google.com')
>>> f.info().gettype()
'text/html'

f.info() is a dictionary-like object, so you can do f.info()['content-type'], etc.

http://docs.python.org/library/urllib.html
http://docs.python.org/library/urllib2.html
http://docs.python.org/library/httplib.html

The docs note that httplib is not normally used directly.

  • 14
    However, urllib will do a GET and the question is about performing a HEAD. Maybe the poster does not want to retrieve an expensive document. – Philippe F May 06 '09 at 08:30