How do web servers know the charset using in forms posted to them?

Question

When a web server gets a POST of a form, parsing it into param-value(s) pairs is quite straightforward. However, if the values contain non-English chars that have been encoded by the browser, it must know the charset used in order to decode them.

I've examined the requests sent by two posts. One was done from a page using UTF-8, and one from a page using Windows-1255. The same text was encoded differently. AFAIK, the Content-type header could contain a charset after the application/x-www-form-urlencoded, but it wasn't (using Firefox).

In a servlet, when you use request.getParameter(), you're supposed to get the decoded value. How does the servlet container do that? Does it always bet on UTF-8, use some heuristics, or is there some deterministic way I'm missing?

Possible duplicate of http://stackoverflow.com/questions/708915/detecting-the-character-encoding-of-an-http-post-request — Sripathi Krishnan, Jun 21 '11 at 05:11
@Sripathi Krishnan - They are similar, I agree... Would still like to know, though, how commonly adopted frameworks are dealing with this lack of information. Since most of the internet works, imitating their behavior is probably the most effective way. — Eran, Jun 21 '11 at 05:24

score 1 · Accepted Answer · edited May 23 '17 at 12:04

From the Serlvet 3.0 Spec, section 3.10 Request Data Encoding (emphasis mine)

Currently, many browsers do not send a char encoding qualifier with the ContentType header, leaving open the determination of the character encoding for reading HTTP requests. The default encoding of a request the container uses to create the request reader and parse POST data must be “ISO-8859-1” if none has been specified by the client request. However, in order to indicate to the developer, in this case, the failure of the client to send a character encoding, the container returns null from the getCharacterEncoding method.

If the client hasn’t set character encoding and the request data is encoded with a different encoding than the default as described above, breakage can occur. To remedy this situation, a new method setCharacterEncoding(String enc) has been added to the ServletRequest interface. Developers can override the character encoding supplied by the container by calling this method. It must be called prior to parsing any post data or reading any input from the request. Calling this method once data has been read will not affect the encoding.

In practice, I find that setting the charset in a response influences the charset used in the subsequent POST. To be extra sure, you can write a Servlet Filter that calls the setCharacterEncoding on every request object before it is used.

You may also find this thread useful - Detecting the character encoding of an HTTP POST request

score -1 · Answer 2 · answered Jun 20 '11 at 14:02

The apropriate header for specifying charsets is Accept-Charset.

Latest Chrome for linux, e.g., spits: Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3

on each request.

Section 14.2 from http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html states:

The Accept-Charset request-header field can be used to indicate what character sets are acceptable for the response. This field allows clients capable of understanding more comprehensive or special- purpose character sets to signal that capability to a server which is capable of representing documents in those character sets.

(...)

If no Accept-Charset header is present, the default is that any character set is acceptable. If an Accept-Charset header is present, and if the server cannot send a response which is acceptable according to the Accept-Charset header, then the server SHOULD send an error response with the 406 (not acceptable) status code, though the sending of an unacceptable response is also allowed.

So if you receive such a header from a client, the value with highest q can be the encoding you're receiving from it.

That's the encoding accepted by the client; it doesn't necessarily indicate the encoding used for the POST parameters. — Julian Reschke, Jun 20 '11 at 14:52
Right, that's why I said "CAN be the encoding". If the browser favors one encoding for response, why should it post with another ? — Niloct, Jun 20 '11 at 15:01
...because it assumes that the server expects the encoding of the page that contained the form. — Julian Reschke, Jun 20 '11 at 15:59
The Accept-Charset header is set by the browser, but encoding used for the form's content is determined by the posting page's encoding. Furthermore, on a tested Windows-1255 encoded paged, the Accept-Charset didn't include Window-1255, just the regular "ISO-8859-1,utf-8;q=0.7,*;q=0.7". — Eran, Jun 20 '11 at 18:17

How do web servers know the charset using in forms posted to them?

2 Answers2