How do browsers decide which character set to use when sending requests? And how should we deal with it?

Question

tl;dr: When the browser/user-agent submits a form, it gets submitted as UTF-8 (in my tests), but does not include that information in the HTTP request. How does the user-agent decide to use UTF-8? And how should the application code (the code which receives the request) decide which character set to use to decode the incoming data?

Over the past few days I have been digging around the internet to find out how data is encoded when sent from the browser to the web-server. It turns out the matter is non-trivial as there are no clear standards on this matter.

~~RFC2616 (HTTP) is largely based on ISO-8859-1 and US-ASCII. But extensions exist to allow for other character sets (like RFC2047).~~ edit: RFC2616 has been obsoleted by RFC7231 which has removed the note about ISO-8859-1 (see Appendix B)

The Request Body

Essentially, when a user agent sends a request which contains a body, the problem seems to be well defined: Use a Content-Type header including a charset parameter. For example:

Content-Type: text/plain; charset=utf-8

This is easy to do with JavaScript. But today, I ran into the problem that you cannot specify the charset when using a HTML Form element. In the search, I came across this SO question, but in my opinion, the answer is incorrect. It claims to use the accept-charset attribute. But from the reference, this header is used to tell the server what charset is acceptable by the client/user-agent. Not the other way around.

The related FORM attribute enctype specifies the content-type of the submitted document. But it only allows three values, and if they are not used as-is, the user-agent (Chrome in this case) defaults to application/x-www-form-urlencoded. You cannot specify a character set, which is good in my opinion, as it is the job of the UA to encode it for you.

But as a result, the request which arrives on the server is completely devoid of any information about the used character set. So how shoud the application code decide which encoding to use?

Another question is: How does the user-agent decide which character set to use when submitting a form? In all my tests they submitted it as UTF-8. But where does this come from? Sniffing the network traffic gave me no indication where this might come from. Although, the originating web-page contains a meta-tag saying that the page is in UTF-8. Is that it?

I assume that the UA is using the same character set as it just received from the server. But what if the page it requests from application A (in UTF-8) contains a form with a POST action to application B. Assuming that is at all possible (the same-origin policy only applies for XHRIO right?)... In that scenario, the UA no "a-priori" information on the encoding. How does it decide what encoding to pick?

HTTP "preamble" and Headers

Just noting this down as a reference

URIs are well-defined after 2005 (see RFC3986), and should use UTF-8. Before that, no standard was defined and it is a bit of guesswork.

Header values are well defined in RFC5987.

References:

Character Set and Language Encoding for Hypertext Transfer Protocol (HTTP) Header Field Parameters - RFC5987
Use of the Content-Disposition Header Field in the Hypertext Transfer Protocol (HTTP) Appendix C - RFC6266
HTML Form Element (enctype)
Uniform Resource Identifier (URI): Generic Syntax - RFC3986

Please stop worrying about RFC2616; it has been obsoleted a few months ago. In this particular case it's not an aspect of HTTP anyway -- as indicated in the answer, it's a property of the HTML form submission process. — Julian Reschke, Nov 04 '14 at 11:57
Indeed. I have been going over the "obsoleted by" references, but somehow missed rfc7231, which clarifies quite a bit in [section 5.3.3](http://tools.ietf.org/html/rfc7231#section-5.3.3). After a day of swimming through RFCs, i must have zoned out at some point :( — exhuma, Nov 04 '14 at 14:25

score 2 · Answer 1 · answered Nov 04 '14 at 10:15

The procedure for user agents selecting an encoding for html 5 form submission is described in section 4.10.22.5, Selecting a form submission encoding.

It defaults to UTF-8 if no (valid) accept-charset element is present on the form.

For html 4 it is:

The default value for [the accept-charset] attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element.

How do browsers decide which character set to use when sending requests? And how should we deal with it?

The Request Body

HTTP "preamble" and Headers

1 Answers1