If the input data is being included in a header field parameter (for example the filename
parameter of the Content-Disposition
header), it can be encoded with email.utils.encode_rfc2231
(as constrained by these specifications, which define a variation of the rfc2231 encoding).
If it is not being included a header field parameter, then it seems that this method cannot be used. In such a situation, the safest bet would likely be to just not include the input, as Julian Reschke wrote; however, if you insist on including the input, you may want to try one of the following methods:
(which may be insecure, since HTTP is not a MIME-compliant protocol, and so unless the MIME-Version
header is used (and possibly even if it is used?), these ways may not work correctly for HTTP.)
One way...
to do this, although it may not be totally foolproof (**edit**: it is *not* foolproof (when used by itself); it accepts `\r\n\r\n`, which terminates headers and starts the body! Therefore `\r` and `\n` would need to handled, unless preceded by non-`\r`/`\n` whitespace (like tabs or spaces.)), is to use the
`email.header` module. This is designed specifically for
rfc822 headers (**edit**: but (seemingly, since the email package used to be several separate modules (
example))
not for HTTP headers!), so would seem to be the tool for the job. This `Header` class is meant for encoding header *values*, not the full `Header-Name: value`, and so is a candidate for this job (where we want to vaidate or escape the value *only*).
(Hint: many of the tools in the email
module are also handy when working with other MIME-format (edit: and possibly also MIME-like) stuff; so too stuff in the cgi
module, cgi.FieldStorage
in particular for HTTP-form parsing.)
However, email.header
only will raise an error if the input seems malicious (seems to contain another (embedded) header); however, it will not, it seems, handle invalid input by escaping it (please correct this in the comments if it is not so). (The charset
parameter should escape the header-fragment, returning valid input, however, it may not have such good compatibility with user agents (email, HTTP, etc.); see here (edit: <a href="https://stackoverflow.com/a/1361646/541412>many HTTP user agents support (not necessarily the charset
parameter of the encoding for the email.header.Header
class (which seems to use some MIME-specific encodings besides rfc2231 encoding), but) the rfc5987 encoding).
Example:
import email.header
import re
def check_string_for_rfc822_header(s):
wip_header_component = str(email.header.Header(s))
if re.search(r'(\r?\n[\S\n\r]|\r[\S\r])', wip_header_component):
raise Exception
else:
return wip_header_component
# testing...
>>> check_string_for_rfc822_header("aaa")
"aaa"
>>> check_string_for_rfc822_header("a\r\nb")
"a\r\nb"
>>> check_string_for_rfc822_header("a\r\nb: c")
<error>
Another way...
to do this, it seems, would be to simply
remove `\r` and `\n` characters (each separately however; do not just remove occurences the full string `\r\n`, since this would still leave these unescaped when occuring separately, and many (most?) HTTP utils will accept each of them separately!). Similarly, we can escape the header by replacing `\r\n`, `\r`, and `\n`, with themselves prepended by whitespace (which is the way to escape header; see
the standard).
However, this method does not take into account the details of the standards (for example, rfc822 headers must be ACSII), which may be exploitable on their own.
Example:
def remove_linebreakers(s):
return s.replace("\n", "").replace("\r", "")
# or...
import re
def remove_linebreakers(s):
re.sub(r'[\n\r]', '', s)
# testing...
>>> remove_linebreakers("aaa")
"aaa"
>>> remove_linebreakers("a\r\nb")
"ab"
>>> remove_linebreakers("a\r\nb: c")
"ab: c"
In summary...
the first way seems better, but only for validation (not for escaping), unless it is a parameter value, in which case escape it using
`email.utils.encode_rfc2231`.
Example:
# if we are not working with a header param value, the following...
# ...raises email.errors.HeaderParseError if input is poisonous when in a header
wip_header_component = str(email.header.Header('<input>'))
header_component = (raise_error() if re.search(r'(\r?\n[\S\n\r]|\r[\S\r])', wip_header_component) else wip_header_component)
# ...or if we *are* working with a header param value...
email.utils.encode_rfc2231('<input>', 'UTF-8')