0

I'm creating a RESTful service where the client may be posting either some XML, JSON, or some unstructured text. Conceivably the client could post chinese characters, etc. There is this question that is nearly the same, Detecting the character encoding of an HTTP POST request, but it is four years old and I wanted to see if any "best practices" had coalesced.

EDIT: This is not for information posted from a form (web page) but for client applications, so the Content-Type of the POST request will be things like text/xml, text/plain, and maybe application/json.

Community
  • 1
  • 1
Aerik
  • 2,307
  • 1
  • 27
  • 39
  • As an interesting tangent, I'm testing my service with a simple web page and some ajax calls. I tried setting the charset of content-type header of my ajax requests for fun. In Chrome if I set it to something other than UTF-8, Chrome *changes* it to UTF-8 - I can see it in the request headers on my server! If I set it in IE, it sends it along to my server. Another interesting note: in C# / asp.net, the HttpRequest.ContentEncoding is set to this value (the Request's charset, as specified in Content-Type header). – Aerik Apr 18 '13 at 21:56

2 Answers2

0

For XML and JSON the best practice is to always encode in UTF-8. XML has mechanisms for different character sets if you really must not use UTF-8, starting with the charset param given to the mimetype and then the charset param of the xml directive.

Steve
  • 1,215
  • 6
  • 11
0

The character set of a www form POST is always ASCII due to the embedded percent encoding, so charset declaration for application/x-www-form-urlencoded is unnecessary. In fact, specifying a charset for this MIME type is invalid.

So to get from:

0x6b65793d76254333254134254333254241254333254142

Into:

key=v%C3%A4%C3%BA%C3%A

Using virtually any encoding will work the same because of ASCII compatibility.

You may notice the data is still encoded. The charset parameter of a request Content-Type only applies to the immediate binaries sent ("converting a sequence of octets into a sequence of characters" as they say in the specs), not to the mechanism used in turning key=v%C3%A4%C3%BA%C3%A into key=väúë, which actually involves converting characters into other characters.

The application/x-www-form-urlencoded scheme "specification" in html4 is pretty useless, but html 5 actually tries. The ultimate default encoding of percent-encoding is UTF-8 with the encoding name transferred in the _charset_ magic parameter if available.

So yeah, there still isn't a good and used formal way (and charset in the Content-Type is just invalid, wrong and misunderstood) to declare the character encoding for the embedded percent-encoding. In practice I would just use UTF-8 and as it's a very strict scheme, fall back to ISO-8859-1 when that fails because you can always go back from ISO-8859-1.


For JSON, using any other encoding outside UTF-8/16/32 is invalid with UTF-8 being assumed everywhere. For XML, you can read the Content-Type header, fallback to encoding attribute and ultimately you have to fallback to UTF-8 and declare invalid if it doesn't compute.

Community
  • 1
  • 1
Esailija
  • 138,174
  • 23
  • 272
  • 326
  • Thank you, that's some interesting stuff, but the clients posting to my service will be sending data with mime types of text/xml, text/plain, and possibly application/json. – Aerik Apr 18 '13 at 23:09
  • @Aerik Oh. The post you linked is completely unrelated then and I wanted to correct the misconceptions posted there. I have small snippet about those types, if it helps. `application/x-www-form-urlencoded` doesn't have to come from browser, btw. – Esailija Apr 18 '13 at 23:16