59

I'm building a web service and have a node that accepts a POST to create a new resource. The resource expects one of two content-types - an XML format I'll be defining, or form-encoded variables.

The idea is that consuming applications can POST XML directly and benefit from better validation etc., but there's also an HTML interface that will POST the form-encoded stuff. Obviously the XML format has a charset declaration, but I can't see how I detect the form's charset just from looking at the POST.

A typical post to the form from Firefox looks like this:

POST /path HTTP/1.1
Host: www.myhostname.com
User-Agent: Mozilla/5.0 [...etc...]
Accept: text/html,application/xhtml+xml, [...etc...]
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 41

field1=value1&field2=value2&field3=value3

Which doesn't seem to contain any useful indication of the character set.

From what I can see, the application/x-www-form-urlencoded type is entirely defined in HTML, which just lays out the %-encoding rules, but doesn't say anything about what charset the data should be in.

Basically, is there any way of telling the character set if I don't know the character set the HTML originally presented was? Otherwise I'll have to try and guess the character set based on what chars are present, and that's always a bit iffy from what I can tell.

bignose
  • 30,281
  • 14
  • 77
  • 110
Ciaran McNulty
  • 18,698
  • 6
  • 32
  • 40
  • 1
    There are many subtleties here and behavior will vary by browser and operating system. One convention used by IE is that if you have a hidden INPUT with the name `_charset_`, IE will fill in that field with the character set it used when submitting the form. See also related question http://stackoverflow.com/questions/12830546/accept-charset-utf-8-parameter-doesnt-do-anything-when-used-in-form – EricLaw Jul 29 '13 at 16:30

3 Answers3

72

the default encoding of a HTTP POST is ISO-8859-1.

else you have to look at the Content-Type header that will then look like

Content-Type: application/x-www-form-urlencoded ; charset=UTF-8

You can maybe declare your form with

<form enctype="application/x-www-form-urlencoded;charset=UTF-8">

or

<form accept-charset="UTF-8">

to force the encoding.

Some references :

http://www.htmlhelp.com/reference/html40/forms/form.html

http://www.w3schools.com/tags/tag_form.asp

chburd
  • 4,131
  • 28
  • 33
  • well I don't know, I'm not a Web developper, i've added links where you can find some references. – chburd Apr 02 '09 at 11:56
  • I tested the default form encoding on Safari and Firefox a few years ago, and found that they always returned UTF-8. Didn't test on IE. I should add that the page with the form was in UTF-8. – David Leppik Jan 17 '11 at 18:04
  • 1
    I should also add that this appears to be in violation of the HTTP standard (see http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.4.1 ). I'm using Tomcat, which claims that the client did not specify a charset in its headers. (I trust Tomcat, but couldn't verify that it was in fact reading the headers properly.) – David Leppik Jan 17 '11 at 18:14
  • 1
    One other thing: HTML4's default format for forms is 'UNKNOWN', i.e. use the page's format. The issue here is browsers that then refuse to specify the charset in the POST. (See http://www.w3.org/TR/html4/interact/forms.html#h-17.3 ) – David Leppik Jan 17 '11 at 18:15
  • 1
    @chburd: This doesn't work on FireFox, by the way. It just ignores the charset from the enctype attribute, and posts in what it wants (UTF-8 it seems). – Pawel Veselov Feb 28 '12 at 18:48
  • @PawelVeselov When you say "ignores it" and uses UTF-8 do you mean that it sends using `Content-Type: application/x-www-form-urlencoded; charset=UTF-8` (ok) or that it sends without a `charset` in the Content-Type (which would be evil). – Michael Anderson Jul 18 '12 at 08:54
  • I deleted the test jsp ;( To my recollection, it was doing the evil thing. – Pawel Veselov Jul 19 '12 at 06:47
  • @PawelVeselov It seems that chrome does the evil thing :( . Will check a few other browsers later. – Michael Anderson Jul 23 '12 at 03:26
  • i was trying to produce this bug from last 2 days, your answered saved my time. thanks – Mubashar Jan 24 '13 at 00:24
  • There is actually no way to _enforce_ a specific charset for the server. If the server doesn't understand a given charset (and there are many out there!), it obviously cannot interpret the url-encoded bytestream correctly. I was having trouble with a buggy REST interface (Adobe CQ5) where I actually had to send an two times URL-encoded UTF-8-encoded string while it ignored the provided charset-info. In short: you never know what's happening to your bytes... – Swen Vermeul Sep 29 '13 at 23:04
  • accept-charset="UTF-8" was enough to make it work in a Struts 2 application with commons-fileupload. – Alfredo Osorio Dec 16 '14 at 16:44
10

The Charset used in the POST will match that of the Charset specified in the HTML hosting the form. Hence if your form is sent using UTF-8 encoding that is the encoding used for the posted content. The URL encoding is applied after the values are converted to the set of octets for the character encoding.

AnthonyWJones
  • 187,081
  • 35
  • 232
  • 306
  • 1
    I was more wondering if there was a stateless way of approaching it, as in without knowledge of the form's character set. – Ciaran McNulty Apr 02 '09 at 10:01
  • No. The client would have to explicitally declare the charset in the HTTP headers for that to work. – Remy Lebeau Aug 05 '10 at 20:06
  • 3
    @CiaranMcNulty that's actually not true, some browsers don't do it. I tried this on FF, forcing the page charset to iso-8859-1, and it still submitted the form in UTF-8 – Pawel Veselov Feb 28 '12 at 18:45
1

Try setting the charset on your Content-Type:

httpCon.setRequestProperty( "Content-Type", "multipart/form-data; charset=UTF-8; boundary=" + boundary );
ZeroConcept
  • 559
  • 3
  • 5