0

I am playing around on the idea adding a integrated HTTP interface to a daemon i am building using Python. I like this approach because it makes the whole daemon code portable.(rather than having a separate web portion and cli portion).

Everything works great but i am wondering about best practices to parse the actual request i receive in the do_GET method.

Here is my prototype do_GET method

def do_GET(self):
        str = "OK"
        print self.request
        self.send_response(200)
        self.send_header("Content-type", "text/html")
        self.send_header("Content-length", len(str))
        self.end_headers()
        self.wfile.write(str)

the request attribute contains the following string when a request is received

127.0.0.1 - - [15/Jan/2014 10:21:23] "GET /" 200 -

Is there a standard library i can use to parse this string? a custom parser that i would need to write i believe first tokenize the string using - as a delimiter and then handle 3rd element with some sort of a regular expression matching [([^\]]+)] for request date and "[[^\"]+" for request path.

i am worried about writing a custom parser because of all the exceptions that i may run into. So i am inquiring about any python standard methods for parsing this.

Thanks for your time.

DevZer0
  • 13,433
  • 7
  • 27
  • 51
  • 1
    If you find yourself unable to find a library capable of parsing these strings, consider using [`pyparsing`](http://pyparsing.wikispaces.com/) rather than using regexes, for improved robustness. – senshin Jan 15 '14 at 03:35
  • @senshin ok thank you for the tip, i will check that. – DevZer0 Jan 15 '14 at 03:38

3 Answers3

2

If you can find a solid library that parses these strings, that's obviously your best bet.

That failing, in case you want to try a solution with pyparsing, this might help you get started:

import re
from pyparsing import Combine, Literal, Regex, White, Word
from pyparsing import alphanums, alphas, nums

data = '127.0.0.1 - - [15/Jan/2014 10:21:23] "GET /" 200 -'

ip_octet = Word(nums, min=1, max=3)
ip_sep = Literal('.')
ip = Combine(ip_octet + ip_sep
             + ip_octet + ip_sep
             + ip_octet + ip_sep
             + ip_octet)

day = Word(nums, min=1, max=2)
month = Word(alphas, exact=3)
year = Word(nums, exact=4)
date_sep = Literal('/')
date = Combine(day + date_sep
               + month + date_sep
               + year)
hms = Word(nums, min=1, max=2)
time_sep = Literal(':')
time = Combine(hms + time_sep
               + hms + time_sep
               + hms)
datetime = Literal('[').suppress() + date + time + Literal(']').suppress()

method = Word(alphas) # GET, etc
# path characters per RFC 1738 / <http://stackoverflow.com/a/1856809/1535629>
path = Word(alphanums + "$-_.+!*'(),/%")
req_enclosure = Literal('"').suppress()
req = req_enclosure + method + path + req_enclosure

code = Word(nums, exact=3) # HTTP status code

nodash = Literal('-').suppress()
parser = ip + nodash + nodash + datetime + req + code + nodash

result = parser.parseString(data)
print(result)

Result:

['127.0.0.1', '15/Jan/2014', '10:21:23', 'GET', '/', '200']

It's a lot more verbose than using re, for sure, but also more readable and maintainable, in my opinion.


Also, if you want, you can use regexes in pyparsing, as follows:

import re
from pyparsing import Regex

data = '127.0.0.1'

ip_re = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
ip = Regex(ip_re)

result = ip.parseString(data)
print(result)

Result:

['127.0.0.1']

This leaves you with the option of mixing and matching regexes and pyparsing features in whatever way you find most convenient.

senshin
  • 10,022
  • 7
  • 46
  • 59
0

http://deron.meranda.us/python/httpheader/ is one such library that can help you parse the HTTP Headers.

praveen
  • 3,193
  • 2
  • 26
  • 30
  • i don't believe this is what i am looking for. The request string i encountered does not have any header information, the library you suggest seems to be used with header parsing when serving python scripts via a standard http server. – DevZer0 Jan 15 '14 at 03:40
0

Ok, with further investigation i found out that the CGIHTTPRequestHandler has a property name called path. so changing the do_GET method as follows provides me the desired result

def do_GET(self):

        str = "OK"

        print self.path

        self.send_response(200)
        self.send_header("Content-type", "text/html")
        self.send_header("Content-length", len(str))
        self.end_headers()
        self.wfile.write(str)

outputs

/send/message

when called with GET /send/message

DevZer0
  • 13,433
  • 7
  • 27
  • 51