2

I have a question about Python regex. I don't have much information about Python regex. I am working with HTTP request messages and parsing them with regex. As you know, the HTTP GET messages are in this format.

GET / HTTP/1.0
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: 10.2.0.12
Connection: Keep-Alive

I want to parse the URI, method, user-agent, and the host areas of the message. My regex for this job is:

r'^({0})\s+(\S+)\s+[^\n]*$\n.*^User-Agent:\s*(\S+)[^\n]*$\n.*^Host:\s*(\S+)[^\n]*$\n'.format('|'.join(methods)), re.MULTILINE|re.DOTALL)

But, when the message comes up with like

GET / HTTP/1.0
Host: 10.2.0.12
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Connection: Keep-Alive

I can not catch them because of the places of host or, user-agent changed. So I need a generic regex that will catch all of them, even if the places of host, method, uri are changed in the message.

rlandster
  • 7,294
  • 14
  • 58
  • 96
barp
  • 6,489
  • 9
  • 30
  • 37
  • 4
    [This should help you](http://stackoverflow.com/questions/4685217/parse-raw-http-headers) – tuxuday May 31 '12 at 11:53
  • @tuxuday +1 Searching the freakin' web is the most powerful skill in a developer's toolbox. – Adam Matan May 31 '12 at 12:16
  • I like the method that @tuxuday says. like this m=re.findall(r"(?P.*?): (?P.*?)\r\n", req).but in this method I cannot parse "GET" and http version. Is it better to add them at the beginning? – barp May 31 '12 at 12:31
  • Do the job the right way. Use `cgi.parse_header()` to get values from the string, or use some tools like WebOb. – kimjxie May 31 '12 at 15:49

2 Answers2

4

Readability Counts (The Zen of Python)

Use findall() for each subexpression you want to find. This way your regex will be short, readable, and independent of the location of the subexpression.

Define a simple, readable regex:

>>> user=re.compile("User-Agent: (.*?)\n")

Test it with two different http headers:

>>> s1='''GET / HTTP/1.0
    Host: 10.2.0.12
    User-Agent: Wget/1.12 (linux-gnu)
    Accept: */*
    Connection: Keep-Alive'''
>>> s2='''GET / HTTP/1.0
    User-Agent: Wget/1.12 (linux-gnu)
    Accept: */*
    Host: 10.2.0.12
    Connection: Keep-Alive'''
>>> user.findall(s1)
['Wget/1.12 (linux-gnu)']
>>> user.findall(s2)
['Wget/1.12 (linux-gnu)']
Adam Matan
  • 128,757
  • 147
  • 397
  • 562
2

Parse the whole headers into a dictionary like so?

headers = """GET / HTTP/1.0
Host: 10.2.0.12
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Connection: Keep-Alive"""


headers = headers.splitlines()
firstLine = headers.pop(0)
(verb, url, version) = firstLine.split()
d = {'verb' : verb, 'url' : url, 'version' : version}
for h in headers:
    h = h.split(': ')
    if len(h) < 2:
        continue
    field=h[0]
    value= h[1]
    d[field] = value

print d

print d['User-Agent']
print d['url']
Maria Zverina
  • 10,863
  • 3
  • 44
  • 61