tl;dr: When the browser/user-agent submits a form, it gets submitted as UTF-8 (in my tests), but does not include that information in the HTTP request. How does the user-agent decide to use UTF-8? And how should the application code (the code which receives the request) decide which character set to use to decode the incoming data?
Over the past few days I have been digging around the internet to find out how data is encoded when sent from the browser to the web-server. It turns out the matter is non-trivial as there are no clear standards on this matter.
RFC2616 (HTTP) is largely based on ISO-8859-1 and US-ASCII. But extensions exist to allow for other character sets (like RFC2047). edit: RFC2616 has been obsoleted by RFC7231 which has removed the note about ISO-8859-1 (see Appendix B)
The Request Body
Essentially, when a user agent sends a request which contains a body, the problem seems to be well defined: Use a Content-Type
header including a charset
parameter. For example:
Content-Type: text/plain; charset=utf-8
This is easy to do with JavaScript. But today, I ran into the problem that you cannot specify the charset when using a HTML Form element. In the search, I came across this SO question, but in my opinion, the answer is incorrect. It claims to use the accept-charset
attribute. But from the reference, this header is used to tell the server what charset is acceptable by the client/user-agent. Not the other way around.
The related FORM attribute enctype
specifies the content-type of the submitted document. But it only allows three values, and if they are not used as-is, the user-agent (Chrome in this case) defaults to application/x-www-form-urlencoded
. You cannot specify a character set, which is good in my opinion, as it is the job of the UA to encode it for you.
But as a result, the request which arrives on the server is completely devoid of any information about the used character set. So how shoud the application code decide which encoding to use?
Another question is: How does the user-agent decide which character set to use when submitting a form? In all my tests they submitted it as UTF-8. But where does this come from? Sniffing the network traffic gave me no indication where this might come from. Although, the originating web-page contains a meta-tag saying that the page is in UTF-8. Is that it?
I assume that the UA is using the same character set as it just received from the server. But what if the page it requests from application A (in UTF-8) contains a form with a POST action to application B. Assuming that is at all possible (the same-origin policy only applies for XHRIO right?)... In that scenario, the UA no "a-priori" information on the encoding. How does it decide what encoding to pick?
HTTP "preamble" and Headers
Just noting this down as a reference
URIs are well-defined after 2005 (see RFC3986), and should use UTF-8. Before that, no standard was defined and it is a bit of guesswork.
Header values are well defined in RFC5987.
References:
- Character Set and Language Encoding for Hypertext Transfer Protocol (HTTP) Header Field Parameters - RFC5987
- Use of the Content-Disposition Header Field in the Hypertext Transfer Protocol (HTTP) Appendix C - RFC6266
- HTML Form Element (enctype)
- Uniform Resource Identifier (URI): Generic Syntax - RFC3986