Python regular expression for HTTP Request header

Question

I have a question about Python regex. I don't have much information about Python regex. I am working with HTTP request messages and parsing them with regex. As you know, the HTTP GET messages are in this format.

GET / HTTP/1.0
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: 10.2.0.12
Connection: Keep-Alive

I want to parse the URI, method, user-agent, and the host areas of the message. My regex for this job is:

r'^({0})\s+(\S+)\s+[^\n]*$\n.*^User-Agent:\s*(\S+)[^\n]*$\n.*^Host:\s*(\S+)[^\n]*$\n'.format('|'.join(methods)), re.MULTILINE|re.DOTALL)

But, when the message comes up with like

GET / HTTP/1.0
Host: 10.2.0.12
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Connection: Keep-Alive

I can not catch them because of the places of host or, user-agent changed. So I need a generic regex that will catch all of them, even if the places of host, method, uri are changed in the message.

[This should help you](http://stackoverflow.com/questions/4685217/parse-raw-http-headers) — tuxuday, May 31 '12 at 11:53
@tuxuday +1 Searching the freakin' web is the most powerful skill in a developer's toolbox. — Adam Matan, May 31 '12 at 12:16
I like the method that @tuxuday says. like this m=re.findall(r"(?P.*?): (?P.*?)\r\n", req).but in this method I cannot parse "GET" and http version. Is it better to add them at the beginning? — barp, May 31 '12 at 12:31
Do the job the right way. Use `cgi.parse_header()` to get values from the string, or use some tools like WebOb. — kimjxie, May 31 '12 at 15:49

Adam Matan · Answer 1 · 2012-05-31T13:37:01.613

Readability Counts (The Zen of Python)

Use findall() for each subexpression you want to find. This way your regex will be short, readable, and independent of the location of the subexpression.

Define a simple, readable regex:

>>> user=re.compile("User-Agent: (.*?)\n")

Test it with two different http headers:

>>> s1='''GET / HTTP/1.0
    Host: 10.2.0.12
    User-Agent: Wget/1.12 (linux-gnu)
    Accept: */*
    Connection: Keep-Alive'''
>>> s2='''GET / HTTP/1.0
    User-Agent: Wget/1.12 (linux-gnu)
    Accept: */*
    Host: 10.2.0.12
    Connection: Keep-Alive'''
>>> user.findall(s1)
['Wget/1.12 (linux-gnu)']
>>> user.findall(s2)
['Wget/1.12 (linux-gnu)']

HTTP spec requires /r/n, so I'd suggest to use re.compile("User-Agent: (.*?)\r\n") instead — Sergey Kandaurov, Feb 09 '22 at 16:12

Maria Zverina · Accepted Answer · 2012-05-31T12:50:41.220

2

Parse the whole headers into a dictionary like so?

headers = """GET / HTTP/1.0
Host: 10.2.0.12
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Connection: Keep-Alive"""


headers = headers.splitlines()
firstLine = headers.pop(0)
(verb, url, version) = firstLine.split()
d = {'verb' : verb, 'url' : url, 'version' : version}
for h in headers:
    h = h.split(': ')
    if len(h) < 2:
        continue
    field=h[0]
    value= h[1]
    d[field] = value

print d

print d['User-Agent']
print d['url']

edited May 31 '12 at 12:50

answered May 31 '12 at 11:57

Maria Zverina

10,863
3
44
61

Remember to strip your values and ignore lines that don't contain `:` - `d=dict([[i.strip() for i in l.split(':')] for l in s1.splitlines() if ":" in l])` – Adam Matan May 31 '12 at 12:04
And +1 - liked your approach. As I wrote, Readability counts. – Adam Matan May 31 '12 at 12:14
I need to parse the "GET method" also. without ':' – barp May 31 '12 at 12:28
Updated to include the first line :) – Maria Zverina May 31 '12 at 12:50

Python regular expression for HTTP Request header

2 Answers2