6

I want to sent a POST request with a file attached, though some of the field names have Unicode characters in them. But they aren't received correctly by the server, as seen below:

>>> # normal, without unicode
>>> resp = requests.post('http://httpbin.org/post', data={'snowman': 'hello'}, files={('kitten.jpg', open('kitten.jpg', 'rb'))}).json()['form']
>>> resp
{u'snowman': u'hello'}
>>>
>>> # with unicode, see that the name has become 'null'
>>> resp = requests.post('http://httpbin.org/post', data={'☃': 'hello'}, files={('kitten.jpg', open('kitten.jpg', 'rb'))}).json()['form']
>>> resp
{u'null': u'hello'}
>>>
>>> # it works without the image
>>> resp = requests.post('http://httpbin.org/post', data={'☃': 'hello'}).json()['form']
>>> resp
{u'\u2603': u'hello'}

How do I come around this problem?

user1814016
  • 2,273
  • 5
  • 25
  • 28
  • For this kind of thing, i personally like to look and see what's really getting sent on the wire before i get too involved in trying things on one end or the other. Wireshark or tcpdump should provide some insight. – Rob Starling Dec 15 '13 at 05:59
  • The field value appears as `form-data;name*=utf-8''%5Cu2603` in Wireshark. I'm not sure how this helps. – user1814016 Dec 15 '13 at 06:15
  • how does that compare to the way it looks in the case that works? – Rob Starling Dec 15 '13 at 06:21
  • Just `%5Cu2603=hello`, since it's just x-www-form-urlencoded. `multipart/form-data` is the problem for some reason, and I don't know why. – user1814016 Dec 15 '13 at 06:29
  • That form-data doesn't look standard; i wonder if it's something "special" that python-requests does if the name isn't 7-bit ascii. What did the 1st case look like on the wire (name=snowman)? – Rob Starling Dec 15 '13 at 07:32
  • `form-data;name=\"snowman\"` – user1814016 Dec 15 '13 at 07:44
  • Requests is OK here. The syntax you see for the name field is according to [RFC 2231](http://tools.ietf.org/search/rfc2231), which specifies how one should send a header field when it uses a non-ASCII encoding. I think the real question is why httpbin's HTTP server (Gunicorn) can't parse it. – Lukasa Dec 15 '13 at 08:48
  • @Lukasa , i dug in a bunch on this, and "don't do it" seemed to be the practical answer :( – Rob Starling Dec 15 '13 at 09:06
  • It's also worth noting that `%5C` is just `\ `, so `%5cu2603` is just `\u2603`, which is definitely not the UTF-8 encoding of Unicode point 2603, but rather a python string literal, which would not be appropriate for the HTTP request. – Rob Starling Dec 15 '13 at 09:11
  • @RobStarling It's interesting that we're doing the encoding wrong, since we're just using [`email.utils.encode_rfc2231()`](http://docs.python.org/2/library/email.util.html#email.utils.encode_rfc2231). – Lukasa Dec 15 '13 at 09:17
  • maybe you're stringifying the python (and getting the 6-character low-ascii string `\u2603`) and then passing _that_ to `email.utils.encode_rfc2231()`? – Rob Starling Dec 15 '13 at 09:18
  • 1
    Perhaps I'm to blame here, since [I introduced that RFC 2231 format into urllib3](https://github.com/shazow/urllib3/pull/223) after reading [RFC 2388 Section 4.4](http://tools.ietf.org/html/rfc2388#section-4.4). Since [this comment](https://github.com/facebook/tornado/pull/869#issuecomment-23632083) pointed out that [the current HTML5 draft explicitely rejects RFC2231](http://www.w3.org/html/wg/drafts/html/master/forms.html#multipart-form-data), I'm no longer certain about anything. We have conflicting standards, and what servers implement is yet another thing altogether. – MvG Jan 04 '14 at 22:43
  • Should we edit the question to include a note that it might be obsolete for versions of Request that use urllib3 later than https://github.com/urllib3/urllib3/pull/1492/files ? – Rob Starling Sep 04 '21 at 19:39

4 Answers4

24

From the wireshark comments, it looks like python-requests is doing it wrong, but that there might not be a "right answer".

RFC 2388 says

Field names originally in non-ASCII character sets may be encoded within the value of the "name" parameter using the standard method described in RFC 2047.

RFC 2047, in turn, says

Generally, an "encoded-word" is a sequence of printable ASCII characters that begins with "=?", ends with "?=", and has two "?"s in between. It specifies a character set and an encoding method, and also includes the original text encoded as graphic ASCII characters, according to the rules for that encoding method.

and goes on to describe "Q" and "B" encoding methods. Using the "Q" (quoted-printable) method, the name would be:

=?utf-8?q?=E2=98=83?=

BUT, as RFC 6266 clearly states:

An 'encoded-word' MUST NOT be used in parameter of a MIME Content-Type or Content-Disposition field, or in any structured field body except within a 'comment' or 'phrase'.

so we're not allowed to do that. (Kudos to @Lukasa for this catch!)

RFC 2388 also says

The original local file name may be supplied as well, either as a "filename" parameter either of the "content-disposition: form-data" header or, in the case of multiple files, in a "content-disposition: file" header of the subpart. The sending application MAY supply a file name; if the file name of the sender's operating system is not in US-ASCII, the file name might be approximated, or encoded using the method of RFC 2231.

And RFC 2231 describes a method that looks more like what you're seeing. In it,

Asterisks ("*") are reused to provide the indicator that language and character set information is present and encoding is being used. A single quote ("'") is used to delimit the character set and language information at the beginning of the parameter value. Percent signs ("%") are used as the encoding flag, which agrees with RFC 2047.

Specifically, an asterisk at the end of a parameter name acts as an indicator that character set and language information may appear at the beginning of the parameter value. A single quote is used to separate the character set, language, and actual value information in the parameter value string, and an percent sign is used to flag octets encoded in hexadecimal.

That is, if this method is employed (and supported on both ends), the name should be:

name*=utf-8''%E2%98%83

Fortunately, RFC 5987 adds an encoding based on RFC 2231 to HTTP headers! (Kudos to @bobince for this find) It says you can (any probably should) include both a RFC 2231-style value and a plain value:

Header field specifications need to define whether multiple instances of parameters with identical parmname components are allowed, and how they should be processed. This specification suggests that a parameter using the extended syntax takes precedence. This would allow producers to use both formats without breaking recipients that do not understand the extended syntax yet.

Example:

foo: bar; title="EURO exchange rates"; title*=utf-8''%e2%82%ac%20exchange%20rates

In their example, however, they "dumb down" the plain value for "legacy clients". This isn't really an option for a form-field name, so it seems like the best approach might be to include both name= and name*= versions, where the plain value is (as @bobince describes it) "just sending the bytes, quoted, in the same encoding as the form", like:

Content-Disposition: form-data; name="☃"; name*=utf-8''%E2%98%83

See also:

Finally, see http://larry.masinter.net/1307multipart-form-data.pdf (also https://www.w3.org/Bugs/Public/show_bug.cgi?id=16909#c8 ), wherein it is recommended to avoid the problem by sticking with ASCII form field names.

Community
  • 1
  • 1
Rob Starling
  • 3,868
  • 3
  • 23
  • 40
  • 1
    This is an awesome answer, but doesn't seem to match my reading of the RFCs. You say it's "unclear" if RFC 2047 encoded-word is allowed in an attribute: it's not, as RFC 6266 clearly states: "An 'encoded-word' MUST NOT be used in parameter of a MIME Content-Type or Content-Disposition field, or in any structured field body except within a 'comment' or 'phrase'." RFC 2231 is the right thing here: I'm interested in why the standard library isn't doing it right. – Lukasa Dec 15 '13 at 09:19
  • Well, looks like I threw the blame around too soon: the standard library is doing the right thing. Back to urllib3. – Lukasa Dec 15 '13 at 09:20
  • RFC 2231 looks like the right thing, but i'm not finding permission to use it for `name` in RFC 2388 ... help me connect the dots? or maybe it's in a later HTTP RFC? – Rob Starling Dec 15 '13 at 09:21
  • I don't see any explicit permission either, though neither is it explicitly ruled out like RFC 2047. That's probably why it's poorly supported: gunicorn doesn't seem to like it at all. – Lukasa Dec 15 '13 at 09:25
  • edited to include your catch about RFC 6266 - thanks! – Rob Starling Dec 15 '13 at 09:27
  • What I'm confused about is how all major web browsers don't have this problem -- surely it can't be on the server's side? – user1814016 Dec 15 '13 at 09:40
  • @user1814016 do they? i don't think it's actually very common to use non-ascii *names* for form fields. As for *filenames* (which is much more common), it actually *is* clear that one should use RFC 2231 encoding for them. – Rob Starling Dec 15 '13 at 09:42
  • @user1814016 when i hit submit on that jsfiddle, it says (among other things) `"form": { "\u2603": "hello" },`, which is **not** right, imo. – Rob Starling Dec 15 '13 at 09:53
  • Would that not just be how httpbin displays unicode (non-multipart requests also have that)? Anyway, requestbin seems to just display a different character: http://requestb.in/oo0vndoo?inspect#ye09ke , very odd. – user1814016 Dec 15 '13 at 09:57
  • @user1814016 oh, yes. my mistake. the `"\u2603"` could totally be right. The different answer from requestbin, however, unfortunately supports the "it might not work well" conclusion :( – Rob Starling Dec 15 '13 at 10:00
  • Since requestbin is ephemeral, here's the "different character" mentioned: `Content-Disposition: form-data; name="â"` – Rob Starling Dec 15 '13 at 10:02
  • The browsers all seem to send `Content-Disposition: form-data; name="\342\230\203"` (through Wireshark; inspection consoles show the symbol), which is indeed the snowman, though requestbin interprets it differently for some reason. – user1814016 Dec 15 '13 at 10:13
  • `\342\230\203` is just raw UTF-8 (`E2 98 83`) with no indication of encoding. This also explains requestbin's interpretation a bit, as `E2` is extended ascii for `â` (Dunno why it dropped the `˜` and `ƒ`) – Rob Starling Dec 15 '13 at 10:29
  • The HTML's charset is set to UTF-8 in both the HTTP headers and in the HTML itself. Since the `
    ` element does not have an `accept-charset` attribute, browsers must submit the form field names/values using the charset of the HTML. That behavior is dictated by W3C's HTML specs, which is why you are seeing raw UTF-8 octets without an accompanied charset specifier.
    – Remy Lebeau Dec 17 '13 at 02:31
4

The field value appears as form-data;name*=utf-8''%5Cu2603 in Wireshark

Two things here.

  1. It doesn't for me, I get name*=utf-8''%E2%98%83. %5Cu2603 is what I would expect from accidentally typing a \u escape in a non-Unicode string, ie writing '\u2603' rather than '☃' as above.

  2. As discussed at some length, this is the RFC 2231 form of extended Unicode headers:

RFC 2231 format was previously invalid in HTTP (HTTP is not an mail standard in the RFC 822 family). It has now been brought to HTTP by RFC 5987, but because that is pretty recent almost nothing on the server side supports it.

Definitely urllib3 should not be relying on it; it should be doing what browsers do and just sending the bytes, quoted, in the same encoding as the form. If it must use the 2231 form it should be in combination, as in section 4.2. eg in urllib3.fields.format_header_param, instead of:

value = email.utils.encode_rfc2231(value, 'utf-8')

You could say:

value = '%s="%s"; %s*=%s' % (
    name, value, name,
    email.utils.encode_rfc2231(value, 'utf-8')
)

However, including the 2231 form at all may still confuse some older servers.

Community
  • 1
  • 1
bobince
  • 528,062
  • 107
  • 651
  • 834
  • thanks for the RFC 5987 find! I edited my answer a bit to include it. – Rob Starling Dec 15 '13 at 19:47
  • The use of RFC 2231 for file names is inside a nested MIME header, but inside the HTTP *body*. So RFC 5987 does not apply to that. The fact about servers being confused is certainly true, though, and not only old ones. :-( – MvG Jan 04 '14 at 22:45
1

I guess I'm the one to blame for the fact that urllib3 and therefore Requests produces the format it does. When I wrote that code, I had mostly file names in mind, and RFC 2388 section 4.5 suggests the use of that RFC 2231 format there.

With respect to field names, RFC 2388 section 3 refers to RFC 2047, which in turn forbids the use of encoded words in Content-Disposition fields. So it seems to me and others that these two standards contradict one another. But perhaps RFC 2338 should take precedence, so perhaps using RFC 2047 encoded words would be more correct.

Recently I've been made aware of the fact that the current draft for the HTML 5 standard has a section on the encoding of multipart/form-data. It contradicts several other standards, but nevertheless it might be the future. With regard to field names (not file names) it describes an encoding which turns characters into decimal XML entities, e.g. ☃ for your snowman. However, that encoding should only be applied if the encoding established for the submission does not contain the character in question, which should not be the case in your setup.

I've filed an issue for urllib3 to discuss the consequences of this, and probably address them in the implementation.

Community
  • 1
  • 1
MvG
  • 57,380
  • 22
  • 148
  • 276
0

Rob Starling's answer is very insightful and proves that using non-ASCII characters in field names is a bad idea compatibility-wise (all those RFCs!), but I managed to get python-requests to adhere to the most used (from what I can see) method of handling things.

Inside site-packages/requests/packages/urllib3/fields.py, delete this (line ~50):

value = email.utils.encode_rfc2231(value, 'utf-8')

And change the line right underneath it to this:

value = '%s="%s"' % (name, value.decode('utf-8'))

This makes servers (that I've tested) pick up the field and process it correctly.

user1814016
  • 2,273
  • 5
  • 25
  • 28
  • `.decode('utf-8')` doesn't seem right, as on Python 2 we should be aiming to return a byte string and on 3 `value` might not be a byte string (so `decode` would fail). (I'm not sure urllib's attempts to mix 2 and 3 string types really works here in general!) It might be simpler to replace the whole function with `return '%s="%s"' % (name, value)`... – bobince Dec 15 '13 at 14:11