Trying to address this issue, I'm trying to wrap my head around the various functions in the Python standard library aimed at supporting RFC 2231. The main aim of that RFC appears to be three-fold: allowing non-ASCII encoding in header parameters, noting the language of a given value, and allowing header parameters to span multiple lines. The email.util
library provides several functions to deal with various aspects of this. As far as I can tell, they work as follows:
decode_rfc2231
only splits the value of such a parameter into its parts, like this:
>>> email.utils.decode_rfc2231("utf-8''T%C3%A4st.txt")
['utf-8', '', 'T%C3%A4st.txt']
decode_params
takes care of detecting RFC2231-encoded parameters. It collects parts which belong together, and also decodes the url-encoded string to a byte sequence. This byte sequence, however, is then encoded as latin1. And all values are enclosed in quotation marks. Furthermore, there is some special handling for the first argument, which still has to be a tuple of two elements, but those two get passed to the result without modification.
>>> email.utils.decode_params([
... (1,2),
... ("foo","bar"),
... ("name*","utf-8''T%C3%A4st.txt"),
... ("baz*0","two"),("baz*1","-part")])
[(1, 2), ('foo', '"bar"'), ('baz', '"two-part"'), ('name', ('utf-8', '', '"Täst.txt"'))]
collapse_rfc2231_value
can be used to convert this triple of encoding, language and byte sequence into a proper unicode string. What has me confused, though, is the fact that if the input was such a triple, then the quotes will be carried over to the output. If, on the other hand, the input was a single quoted string, then these quotes will be removed.
>>> [(k, email.utils.collapse_rfc2231_value(v)) for k, v in
... email.utils.decode_params([
... (1,2),
... ("foo","bar"),
... ("name*","utf-8''T%C3%A4st.txt"),
... ("baz*0","two"),("baz*1","-part")])[1:]]
[('foo', 'bar'), ('baz', 'two-part'), ('name', '"Täst.txt"')]
So it seems that in order to use all this machinery, I'd have to add yet another step to unquote the third element of any tuple I'd encounter. Is this true, or am I missing some point here? I had to figure out a lot of the above with help from the source code, since the docs are a bit vague on the details. I cannot imagine what could be the point behind this selective unquoting. Is there a point to it?
What is the best reference on how to use these functions?
The best I found so far is the email.message.Message
implementation. There, the process seems to be roughly the one outlined above, but every field gets unquoted via _unquotevalue
after the decode_params
, and only get_filename
and get_boundary
collapse their values, all others return a tuple instead. I hope there is something more useful.