1

I have written a short function in Python3 to parse HTTP headers. I was wondering if anyone would be able to take a look at it and tell me if there is anything that I could have done differently to make the code better. What I have currently produces the required outcome but I am not sure if there would be any situation in which this code would not produce the desired result.

This is what I have:

def _parse_headers(self, headers):
  lines = headers.split("\r\n")
  info = lines[0].split(" ")

  method = None
  path = None
  protocol = None
  headers = {}

  if len(info) > 0:
    method = info[0]
  if len(info) > 1:
    path = info[1]
  if len(info) > 2:
    protocol = info[2]

  for line in lines[1:]:
    if line:
      parts = line.split(":")
      key = None
      value = None
      if len(parts) > 0:
        key = parts[0]
      if len(parts) > 1:
        value = parts[1]
      if not key is None and not value is None:
        headers[key.strip().upper()] = value.strip()

  return {
    "method": method,
    "path": path,
    "protocol": protocol,
    "headers": headers
  }
TechnoCF
  • 178
  • 1
  • 8
  • [This answer](http://stackoverflow.com/a/5955949/2629998) gives a nice way of parsing the headers using methods from the standard library. Use it instead of rolling your own code. –  Sep 11 '14 at 18:40
  • I can see some problems here. This does not properly handle headers that span multiple lines, and does not properly handle headers whose values contain a `:` character. There is also the issue of only recognizing `\r\n` line breaks, although `\n` line breaks are not strictly conformant, you should either explicitly accept or reject them. – Dietrich Epp Sep 11 '14 at 19:06
  • I agree with the other posters who recommend using an existing parsing library. But if you do want to "roll your own" you can eliminate that triple `if` construction with this hack: `method, path, protocol = (info + 3*[None])[:3]`. But it **is** a hack. :) – PM 2Ring Sep 11 '14 at 20:06

1 Answers1

1

As noted by André in the comments, parsing HTTP is not to be taken lightly, unless as an exercise. In real programs you should generally stick to existing, mature implementations if possible.

Note that beyond the overall message structure, every header has its own peculiar internal structure, and you will often need to parse that too; Werkzeug can help there.

The obvious specific problems with your code are:

  • given a header Host: www.example.com:80, it will return www.example.com as its value;
  • given multiple headers with the same name, it will only return the value of the last one.
Vasiliy Faronov
  • 11,840
  • 2
  • 38
  • 49
  • I've fixed the first bullet point, but for the second one how would I tackle that? – TechnoCF Sep 11 '14 at 19:18
  • @TechnoCF Use data structures similar to [those for email headers](https://docs.python.org/3/library/email.message.html#email.message.Message), as that’s the origin of this message format. [See the standard `http.server`.](https://docs.python.org/3/library/http.server.html#http.server.BaseHTTPRequestHandler.MessageClass) – Vasiliy Faronov Sep 11 '14 at 20:17