2

Is there a standard that browser http-posting follows? If not can the server detect the encoding in any way?

interstar
  • 26,048
  • 36
  • 112
  • 180
  • I think your question may be answered here: http://stackoverflow.com/questions/708915/detecting-the-character-encoding-of-an-http-post-request – John Lockwood Jul 10 '13 at 19:32

1 Answers1

1

Is there a standard that browser http-posting follows?

There is now as HTML5 has codified it, but it's not straightforward.

The encoding used by the browser to encode text when submitted a form is usually the same encoding that it used to view the page containing the form. So if you have included a Content-Type: ...;charset=... HTTP header or <meta> tag then that encoding will be used unless the user deliberately changes the encoding of the page from the browser settings.

Users won't generally change this setting unless your page has been served with the wrong charset and is unreadable. (Even then, the setting is getting more obscure in modern browsers.)

If you don't set the encoding of the page containing the form then you could get anything; often it'll be the non-UTF encoding associated with the user's region, but all bets are off.

If you include the attribute accept-charset="..." in your <form> element then you are supposed to always get the form submitted in that encoding, regardless of the encoding of the form page (whether set by the page or chosen by the user). Unfortunately, accept-charset is broken in IE: the given charset is only used when the form contains characters outside of the range that can be encoded in the page's encoding. This makes the submitted encoding inconsistent depending on the entered content.

There is a workaround to this if the charset you want is UTF-8 (and usually it will be): include a field containing a character that does not exist in any non-UTF encoding. One possible choice is the Replacement Character:

<form accept-charset="utf-8">
<input type="hidden" name="enforce-charset" value="&#xFFFD;"/>

Finally, if a form contains characters that are outside the chosen encoding for submitting the form, then those characters are sent encoded as HTML character references. This is really confusing because that kind of encoding is never normally used in forms, and it's an unrecoverable mangling because given &#233; you can never tell if the user really typed &#233; or é.

If not can the server detect the encoding in any way?

This should have been doable at least for POST forms by having browsers pass Content-Type: ...;charset= headers with form submissions. Unfortunately no actual browsers do this. A few servers support it, but when the guys at Mozilla tried to implement it in Firefox it broke loads of other servers, so reality is it ain't ever going to happen.

There is a newer IE extension that has recently been included in HTML5, which is to add to your form:

<input type="hidden" name="_charset_"/>

(Both the type and name are important.) Browsers that support this hack will submit a form parameter called _charset_ set to the encoding it is sending, eg utf-8, or windows-1252. If your server knows the encoding it can pick that up and work with it.

Generally the recipe for handling form submissions consistently is: serve your own forms in pages marked as containing UTF-8; if you care enough about the user sabotaging the encoding, include accept-charset and the enforcement hack.

If you have to accept form submissions from elsewhere and you can't persuade them to include either accept-charset and the enforcement hack, or the _charset_ hack, than all you have is guesswork.

bobince
  • 528,062
  • 107
  • 651
  • 834