9

I am looking for a native way to parse an http request in Python 3.

This question shows a way to do it in Python 2, but uses now deprecated modules, (and Python 2) and I am looking for a way to do it in Python 3.

I would mainly like to just figure out what resource is requested and parse the headers and from a simple request. (i.e):

GET /index.html HTTP/1.1
Host: localhost
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8

Can someone show me a basic way to parse this request?

Community
  • 1
  • 1
Startec
  • 12,496
  • 23
  • 93
  • 160
  • 1
    Your first sentence shows that you know you should just use a library (e.g. `urllib3`, `requests`). Then you say you're trying to do it in Python 3, and don't know how. Why don't you just use `requests`? – Jonathon Reinhart Aug 22 '16 at 23:54
  • @JonathonReinhart I am working in an environment that does not allow the use of third party libraries. – Startec Aug 23 '16 at 00:34
  • 1
    urllib is not third party – OneCricketeer Aug 23 '16 at 00:58
  • And it would appear this class in the standard library does what you want. https://docs.python.org/3/library/http.server.html#http.server.BaseHTTPRequestHandler.MessageClass – OneCricketeer Aug 23 '16 at 01:02
  • 1
    @cricket_007 he does not mention `urllib`. He mentions `urllib3` which is third party. – Startec Aug 23 '16 at 01:48
  • Try kiss-headers, a dedicated library to parse them the right way. https://pypi.org/project/kiss-headers/ – Ousret Apr 13 '20 at 14:10

3 Answers3

7

You could use the email.message.Message class from the email module in the standard library.

By modifying the answer from the question you linked, below is a Python3 example of parsing HTTP headers.

Suppose you wanted to create a dictionary containing all of your header fields:

import email
import pprint

request_string = 'GET / HTTP/1.1\r\nHost: localhost\r\nConnection: keep-alive\r\nCache-Control: max-age=0\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\r\nAccept-Encoding: gzip, deflate, sdch\r\nAccept-Language: en-US,en;q=0.8'

# pop the first line so we only process headers
_, headers = request_string.split('\r\n', 1)

# construct a message from the request string. note: the return is already a dict-like object.
message = email.message_from_string(headers)

# construct a dictionary containing the headers
headers = dict(message.items())

# pretty-print the dictionary of headers
pprint.pprint(headers, width=160)

if you ran this at a python prompt, the result would look like:

{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
 'Accept-Encoding': 'gzip, deflate, sdch',
 'Accept-Language': 'en-US,en;q=0.8',
 'Cache-Control': 'max-age=0',
 'Connection': 'keep-alive',
 'Host': 'localhost',
 'Upgrade-Insecure-Requests': '1',
 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
newUserHa
  • 3
  • 1
  • 2
Corey Goldberg
  • 59,062
  • 28
  • 129
  • 143
  • This is great - and yes, sorry my formatting of the original request was bad. However, where do I get the resource here? (i.e. the actual resource being requested). Since we `pop` it, how do I know what was actually requested? – Startec Aug 23 '16 at 01:53
  • @Startec it would be in the first line, along with the request method and protocol version. – Corey Goldberg Aug 23 '16 at 01:56
  • So I would have to do some string splitting on the first line? – Startec Aug 23 '16 at 02:00
  • yes, you could probably just split the first line on whitespace to grab the resource name. – Corey Goldberg Aug 23 '16 at 02:20
  • Thanks for your excellent answer. Could you describe what the `StringIO` call is doing here? – Startec Aug 23 '16 at 06:16
  • 1
    @Startec `StringIO` is creating a in-memory file-object to feed `email.message_from_file` (which expects a text stream). You can also [parse messages directly from bytes, strings or binary streams](https://docs.python.org/3/library/email.parser.html#email.message_from_bytes). – Nuno André Jan 25 '20 at 05:20
2

Each one of those field names should be delimited by carriage return then newline, and then the field name and value are delimited by a colon. So assuming you already have the response as a string, it should be as easy as:

fields = resp.split("\r\n")
fields = fields[1:] #ignore the GET / HTTP/1.1
output = {}
for field in fields:
    key,value = field.split(':', 1)#split each line by http field name and value
    output[key] = value

Update 4/13

Using the example http resp in the linked to post:

resp = 'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nA
ccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozill
a/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.
13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n'


fields = resp.split("\r\n")
fields = fields[1:] #ignore the GET / HTTP/1.1
output = {}
for field in fields:
    if not field:
        continue
    key,value = field.split(':', 1)
    output[key] = value    
print(output)

An additional check to make sure field is not empty is needed. OUtput:

{'Host': ' www.google.com', 'Connection': ' keep-alive', 'Accept': ' application/xml,application/xhtml+xml,text/html;q=
0.9,text/plain;q=0.8,image/png,*/*;q=0.5', 'User-Agent': ' Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) App
leWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13', 'Accept-Encoding': ' gzip,deflate,sdch', 'Avail-D
ictionary': ' GeNLY2f-', 'Accept-Language': ' en-US,en;q=0.8'}
liviaerxin
  • 579
  • 6
  • 13
Liam Kelly
  • 3,524
  • 1
  • 17
  • 41
  • 1
    That code won't work. Patch it by add maxsplit=1 to split() and it would be actually better. And you may want to split by \n instead of \r\n, that way it would be more generic Then do not forget \r at the end if any.. – Ousret Apr 13 '20 at 14:27
  • 1
    You may want to consider a dedicated library like kiss-headers to handle them properly. – Ousret Apr 13 '20 at 14:30
  • @Ousret - updated post to show that code works even on the example request in the post. I did need to have quick error check if field was empty, but for example code it holds up. As for using libraries, that is a good default choice. – Liam Kelly Apr 13 '20 at 15:41
  • 1
    Check out this header : `User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0` It will fail with this. ;) – Ousret Apr 13 '20 at 16:08
  • You may have to ignore the second line as well (i.e. `Host` field and value) in case the port number is explicitly included in the url. I.E. use `fields = fields[2:]` or `key,value = field.split(':')` will throw error. – Matthew Thomas Nov 05 '22 at 00:03
0

Here are some Python packages aimed at proper HTTP protocol parsing:

buherator
  • 111
  • 3