15

I have a Python web client that uses urllib2. It is easy enough to add HTTP headers to my outgoing requests. I just create a dictionary of the headers I want to add, and pass it to the Request initializer.

However, other "standard" HTTP headers get added to the request as well as the custom ones I explicitly add. When I sniff the request using Wireshark, I see headers besides the ones I add myself. My question is how do a I get access to these headers? I want to log every request (including the full set of HTTP headers), and can't figure out how.

any pointers?

in a nutshell: How do I get all the outgoing headers from an HTTP request created by urllib2?

Corey Goldberg
  • 59,062
  • 28
  • 129
  • 143

8 Answers8

10

If you want to see the literal HTTP request that is sent out, and therefore see every last header exactly as it is represented on the wire, then you can tell urllib2 to use your own version of an HTTPHandler that prints out (or saves, or whatever) the outgoing HTTP request.

import httplib, urllib2

class MyHTTPConnection(httplib.HTTPConnection):
    def send(self, s):
        print s  # or save them, or whatever!
        httplib.HTTPConnection.send(self, s)

class MyHTTPHandler(urllib2.HTTPHandler):
    def http_open(self, req):
        return self.do_open(MyHTTPConnection, req)

opener = urllib2.build_opener(MyHTTPHandler)
response = opener.open('http://www.google.com/')

The result of running this code is:

GET / HTTP/1.1
Accept-Encoding: identity
Host: www.google.com
Connection: close
User-Agent: Python-urllib/2.6
Raj
  • 3,791
  • 5
  • 43
  • 56
Brandon Rhodes
  • 83,755
  • 16
  • 106
  • 147
  • 3
    If connecting via SSL, use `urllib2.HTTPSHandler` (`https_open()`) and `httplib.HTTPSConnection` instead. – Haw-Bin Nov 30 '11 at 20:34
5

The urllib2 library uses OpenerDirector objects to handle the actual opening. Fortunately, the python library provides defaults so you don't have to. It is, however, these OpenerDirector objects that are adding the extra headers.

To see what they are after the request has been sent (so that you can log it, for example):

req = urllib2.Request(url='http://google.com')
response = urllib2.urlopen(req)
print req.unredirected_hdrs

(produces {'Host': 'google.com', 'User-agent': 'Python-urllib/2.5'} etc)

The unredirected_hdrs is where the OpenerDirectors dump their extra headers. Simply looking at req.headers will show only your own headers - the library leaves those unmolested for you.

If you need to see the headers before you send the request, you'll need to subclass the OpenerDirector in order to intercept the transmission.

Hope that helps.

EDIT: I forgot to mention that, once the request as been sent, req.header_items() will give you a list of tuples of ALL the headers, with both your own and the ones added by the OpenerDirector. I should have mentioned this first since it's the most straightforward :-) Sorry.

EDIT 2: After your question about an example for defining your own handler, here's the sample I came up with. The concern in any monkeying with the request chain is that we need to be sure that the handler is safe for multiple requests, which is why I'm uncomfortable just replacing the definition of putheader on the HTTPConnection class directly.

Sadly, because the internals of HTTPConnection and the AbstractHTTPHandler are very internal, we have to reproduce much of the code from the python library to inject our custom behaviour. Assuming I've not goofed below and this works as well as it did in my 5 minutes of testing, please be careful to revisit this override if you update your Python version to a revision number (ie: 2.5.x to 2.5.y or 2.5 to 2.6, etc).

I should therefore mention that I am on Python 2.5.1. If you have 2.6 or, particularly, 3.0, you may need to adjust this accordingly.

Please let me know if this doesn't work. I'm having waaaayyyy too much fun with this question:

import urllib2
import httplib
import socket


class CustomHTTPConnection(httplib.HTTPConnection):

    def __init__(self, *args, **kwargs):
        httplib.HTTPConnection.__init__(self, *args, **kwargs)
        self.stored_headers = []

    def putheader(self, header, value):
        self.stored_headers.append((header, value))
        httplib.HTTPConnection.putheader(self, header, value)


class HTTPCaptureHeaderHandler(urllib2.AbstractHTTPHandler):

    def http_open(self, req):
        return self.do_open(CustomHTTPConnection, req)

    http_request = urllib2.AbstractHTTPHandler.do_request_

    def do_open(self, http_class, req):
        # All code here lifted directly from the python library
        host = req.get_host()
        if not host:
            raise URLError('no host given')

        h = http_class(host) # will parse host:port
        h.set_debuglevel(self._debuglevel)

        headers = dict(req.headers)
        headers.update(req.unredirected_hdrs)
        headers["Connection"] = "close"
        headers = dict(
            (name.title(), val) for name, val in headers.items())
        try:
            h.request(req.get_method(), req.get_selector(), req.data, headers)
            r = h.getresponse()
        except socket.error, err: # XXX what error?
            raise urllib2.URLError(err)
        r.recv = r.read
        fp = socket._fileobject(r, close=True)

        resp = urllib2.addinfourl(fp, r.msg, req.get_full_url())
        resp.code = r.status
        resp.msg = r.reason

        # This is the line we're adding
        req.all_sent_headers = h.stored_headers
        return resp

my_handler = HTTPCaptureHeaderHandler()
opener = urllib2.OpenerDirector()
opener.add_handler(my_handler)
req = urllib2.Request(url='http://www.google.com')

resp = opener.open(req)

print req.all_sent_headers

shows: [('Accept-Encoding', 'identity'), ('Host', 'www.google.com'), ('Connection', 'close'), ('User-Agent', 'Python-urllib/2.5')]
Jarret Hardie
  • 95,172
  • 10
  • 132
  • 126
  • this is very helpful. however, I am not seeing *all* the headers still (like, Connection: Close) – Corey Goldberg Mar 02 '09 at 21:32
  • Hmmm.... would you mind posting how you're constructing the Request and how you're opening the connection please? – Jarret Hardie Mar 02 '09 at 21:34
  • opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookie_jar)) request = urllib2.Request(url, None, headers) – Corey Goldberg Mar 02 '09 at 21:38
  • I don't think req.header_items() will include headers sent by the underlying HTTPConnection. – Justus Mar 02 '09 at 21:40
  • Justus is right. Specifically with 'Connection: Close'... the Opener has a method called 'do_open' where that gets added. It's added by a local variable in that function, which constructs a totally separate request object; that request object is thrown away at the end of the function scope – Jarret Hardie Mar 02 '09 at 21:42
  • In this case, I'm afraid you will have to write your own opener, since that default do_open function does its own socket initialization, so there's very little opportunity for you to inject anything, or observe anything via subclassing – Jarret Hardie Mar 02 '09 at 21:43
  • For inspiration, have a look at urllib2.py. You're looking for the AbstractHTTPHandler.do_open() function. Sorry I can't think of a more magic way to get that level of info – Jarret Hardie Mar 02 '09 at 21:44
  • OK... I should think before I comment :-) It's the handler, not the opener. You can use the build_opener() function you are already using, but provide your own Handler, rather than having to write an opener... this is not something I've done in some time. – Jarret Hardie Mar 02 '09 at 21:47
  • Jarret, see the JUSTUS answer, he almost nailed it, but it keep appending new headers if I call it over and over. any ideas to add to his solution? – Corey Goldberg Mar 02 '09 at 21:53
  • Jarret, any tip on providing my own handler that will do this. getting over my head :) – Corey Goldberg Mar 02 '09 at 22:02
  • Happy to help... this is a good question. I've updated by answer above. – Jarret Hardie Mar 02 '09 at 22:40
  • wow.. I was hoping it would be something easy. I'll give your approach a try (holy cow!). I think your code is too much to add to my codebase though :) This feature would be a great addition to urllib2 or httplib. – Corey Goldberg Mar 02 '09 at 22:52
  • Heh heh... yeah. I suppose you could look into using 'curl' from the shell if that's appropriate for your deployment env, or doing something much lower level with sockets... you'd lose a lot of sanity checking, but it would be shorter. – Jarret Hardie Mar 03 '09 at 00:11
2

How about something like this:

import urllib2
import httplib

old_putheader = httplib.HTTPConnection.putheader
def putheader(self, header, value):
    print header, value
    old_putheader(self, header, value)
httplib.HTTPConnection.putheader = putheader

urllib2.urlopen('http://www.google.com')
Justus
  • 203
  • 1
  • 4
  • this is VERY close to what I need. the only prob is when I call it in a loop, it keeps appending repeat headers. – Corey Goldberg Mar 02 '09 at 21:30
  • JUSTUS, this is so close.. can you update your answer if you have any other thoughts? – Corey Goldberg Mar 02 '09 at 22:01
  • I don't understand what you mean by "in a loop". But, given that this requires so much hackery I wonder why you need so much logging. You might be better off using an http proxy, have that do all the logging, and use urllib to talk to it. – Justus Mar 02 '09 at 22:40
  • 2
    well.. i have a load testing tool that sends HTTP requests repeatedly. It has a logging/debug mode where I would like to log the full HTTP requests and responses... including headers. – Corey Goldberg Mar 02 '09 at 22:46
2

A low-level solution:

import httplib

class HTTPConnection2(httplib.HTTPConnection):
    def __init__(self, *args, **kwargs):
        httplib.HTTPConnection.__init__(self, *args, **kwargs)
        self._request_headers = []
        self._request_header = None

    def putheader(self, header, value):
        self._request_headers.append((header, value))
        httplib.HTTPConnection.putheader(self, header, value)

    def send(self, s):
        self._request_header = s
        httplib.HTTPConnection.send(self, s)

    def getresponse(self, *args, **kwargs):
        response = httplib.HTTPConnection.getresponse(self, *args, **kwargs)
        response.request_headers = self._request_headers
        response.request_header = self._request_header
        return response

Example:

conn = HTTPConnection2("www.python.org")
conn.request("GET", "/index.html", headers={
    "User-agent": "test",
    "Referer": "/",
})
response = conn.getresponse()

response.status, response.reason:

1: 200 OK

response.request_headers:

[('Host', 'www.python.org'), ('Accept-Encoding', 'identity'), ('Referer', '/'), ('User-agent', 'test')]

response.request_header:

GET /index.html HTTP/1.1
Host: www.python.org
Accept-Encoding: identity
Referer: /
User-agent: test
jedie
  • 1,803
  • 2
  • 13
  • 7
2

A other solution, witch used the idea from How do you get default headers in a urllib2 Request? But doesn't copy code from std-lib:

class HTTPConnection2(httplib.HTTPConnection):
    """
    Like httplib.HTTPConnection but stores the request headers.
    Used in HTTPConnection3(), see below.
    """
    def __init__(self, *args, **kwargs):
        httplib.HTTPConnection.__init__(self, *args, **kwargs)
        self.request_headers = []
        self.request_header = ""

    def putheader(self, header, value):
        self.request_headers.append((header, value))
        httplib.HTTPConnection.putheader(self, header, value)

    def send(self, s):
        self.request_header = s
        httplib.HTTPConnection.send(self, s)


class HTTPConnection3(object):
    """
    Wrapper around HTTPConnection2
    Used in HTTPHandler2(), see below.
    """
    def __call__(self, *args, **kwargs):
        """
        instance made in urllib2.HTTPHandler.do_open()
        """
        self._conn = HTTPConnection2(*args, **kwargs)
        self.request_headers = self._conn.request_headers
        self.request_header = self._conn.request_header
        return self

    def __getattribute__(self, name):
        """
        Redirect attribute access to the local HTTPConnection() instance.
        """
        if name == "_conn":
            return object.__getattribute__(self, name)
        else:
            return getattr(self._conn, name)


class HTTPHandler2(urllib2.HTTPHandler):
    """
    A HTTPHandler which stores the request headers.
    Used HTTPConnection3, see above.

    >>> opener = urllib2.build_opener(HTTPHandler2)
    >>> opener.addheaders = [("User-agent", "Python test")]
    >>> response = opener.open('http://www.python.org/')

    Get the request headers as a list build with HTTPConnection.putheader():
    >>> response.request_headers
    [('Accept-Encoding', 'identity'), ('Host', 'www.python.org'), ('Connection', 'close'), ('User-Agent', 'Python test')]

    >>> response.request_header
    'GET / HTTP/1.1\\r\\nAccept-Encoding: identity\\r\\nHost: www.python.org\\r\\nConnection: close\\r\\nUser-Agent: Python test\\r\\n\\r\\n'
    """
    def http_open(self, req):
        conn_instance = HTTPConnection3()
        response = self.do_open(conn_instance, req)
        response.request_headers = conn_instance.request_headers
        response.request_header = conn_instance.request_header
        return response

EDIT: Update the source

Community
  • 1
  • 1
jedie
  • 1,803
  • 2
  • 13
  • 7
0

It sounds to me like you're looking for the headers of the response object, which include Connection: close, etc. These headers live in the object returned by urlopen. Getting at them is easy enough:

from urllib2 import urlopen
req = urlopen("http://www.google.com")
print req.headers.headers

req.headers is a instance of httplib.HTTPMessage

dcolish
  • 22,727
  • 1
  • 24
  • 25
  • nope.. was looking for request headers, not response headers – Corey Goldberg Oct 27 '10 at 16:52
  • Ah, well then you'll either need to create your own handler for HTTP requests that dumps this as the above examples might do, or if you are open to tweaking the stdlib, just drop in a log line in AbstractHTTPHandler.do_open that dumps headers. – dcolish Oct 27 '10 at 18:14
  • Variable should have been spelled `rep` as it's reply not request and you should use documented `.info()` method instead of undocumented `headers` attribute. – Piotr Dobrogost Apr 10 '11 at 18:34
0

see urllib2.py:do_request (line 1044 (1067)) and urllib2.py:do_open (line 1073) (line 293) self.addheaders = [('User-agent', client_version)] (only 'User-agent' added)

Mykola Kharechko
  • 3,104
  • 5
  • 31
  • 40
-1

It should send the default http headers (as specified by w3.org) alongside the ones you specify. You can use a tool like WireShark if you would like to see them in their entirety.

Edit:

If you would like to log them, you can use WinPcap to capture packets sent by specific applications (in your case, python). You can also specify the type of packets and many other details.

-John

John T
  • 23,735
  • 11
  • 56
  • 82
  • I need to log them from inside my Python program so WinPcap won't help me. thanks though. – Corey Goldberg Mar 02 '09 at 21:01
  • Yes it will, have you even read what it is or how to use it? It's used with the wireshark program itself, which shows you analyzed output of packets and has the ability to log them. – John T Mar 02 '09 at 22:14
  • the packets contain the headers, I thought that was obvious. You could invoke/incorporate winpcap in your application. – John T Mar 02 '09 at 22:52
  • winpcap is for windows. my application runs all platforms. It is also too much overhead. thanks for the suggestion though. – Corey Goldberg Mar 02 '09 at 22:57