0

I'm implementing a service (as rest) that receives a POST method.

The encoding in my sistem is UTF-8.

I'm using jboss 5, in which the servlet that receives the requests follows the HTTP 1.1 specification of rfc2068 which states that:

When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.

so when the client that invokes my service is using for example UTF-8 and doesn't specify a charset, and the body of the POST contains characters outside the US-ASCII, the Jboss servlet assumes "ISO-8859-1" and does a "wrong" decodification and in my system i receive "broken" characters. For example instead of the string "día" i receive "dÂa".

The approach i found for "protecting" my system is to require the client to specify the charset in the content-type parameter. If a charset is not specified then i respond with an http 403 and a text indicating that "the charset value must be specified".

Is there anything wrong with this approach?

Gustavo Fava
  • 63
  • 1
  • 4
  • What if the client sends `Content-Type: text/plain; charset=ISO-8859-1` with `UTF-8` body? – xiaofeng.li Jul 06 '16 at 02:31
  • @Luke i think in that case you have to take a more defensive approach than mine. Perhaps examining the body. You can take a look at [this post](https://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream). – Gustavo Fava Jul 07 '16 at 12:27

1 Answers1

2

RFC 2068 has been obsoleted twice and really is irrelevant. You need to look at RFC 7231, which doesn't define a default anymore. This means that the default is governed by the definition of the media type.

For text/plain, this implies US-ASCII (as far as I remember), so clients that want to send non-ASCII characters really need to specify the charset.

Julian Reschke
  • 40,156
  • 8
  • 95
  • 98
  • Yes you are right, i took a look at the RFC 7231 that you mentioned, and in [rfc2046 section-4.1.2](https://tools.ietf.org/html/rfc2046#section-4.1.2) it states that for text/plain "...The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII." – Gustavo Fava Jul 06 '16 at 18:06