378

When I make a POST request with a JSON body to my REST service I include Content-type: application/json; charset=utf-8 in the message header. Without this header, I get an error from the service. I can also successfully use Content-type: application/json without the ;charset=utf-8 portion.

What exactly does charset=utf-8 do ? I know it specifies the character encoding but the service works fine without it. Does this encoding limit the characters that can be in the message body?

DenaliHardtail
  • 27,362
  • 56
  • 154
  • 233
  • 5
    take a look at http://www.hanselman.com/blog/WhyTheAskObamaTweetWasGarbledOnScreenKnowYourUTF8UnicodeASCIIAndANSIDecodingMrPresident.aspx – Daniel Powell Feb 13 '12 at 02:58
  • 16
    Intriguingly, according to [IANA's `application/json` Media Type Registration](http://www.iana.org/assignments/media-types/application/json), there doesn't appear to be a supported `charset` parameter at all, albeit often being supplied in practice. – Uux Nov 12 '14 at 09:42
  • 2
    `I know it specifies the character encoding but the service works fine without it.` "working" does not always mean "the existent code/configuration is the most correct way covering all the corner cases to do one thing". It depends on all the conventions and assumptions which may not work under other circumstances. For me personally, I always try to be as explicit as possible. – WesternGun Apr 15 '19 at 13:38
  • 7
    Sending a "charset" parameter is incorrect and meaningless. See RFC 8259, Section 11, last sentence. – Julian Reschke Apr 17 '19 at 04:42
  • 3
    JSON **must** be encoded by UTF-8, and there is **no** "charset" parameter. See [this brief quote](https://stackoverflow.com/a/73074619/10027592) or have a look at [RFC8259](https://www.rfc-editor.org/rfc/rfc8259). – starriet Jul 22 '22 at 03:00

7 Answers7

364

The header just denotes what the content is encoded in. It is not necessarily possible to deduce the type of the content from the content itself, i.e. you can't necessarily just look at the content and know what to do with it. That's what HTTP headers are for, they tell the recipient what kind of content they're (supposedly) dealing with.

Content-type: application/json; charset=utf-8 designates the content to be in JSON format, encoded in the UTF-8 character encoding. Designating the encoding is somewhat redundant for JSON, since the default (only?) encoding for JSON is UTF-8. So in this case the receiving server apparently is happy knowing that it's dealing with JSON and assumes that the encoding is UTF-8 by default, that's why it works with or without the header.

Does this encoding limit the characters that can be in the message body?

No. You can send anything you want in the header and the body. But, if the two don't match, you may get wrong results. If you specify in the header that the content is UTF-8 encoded but you're actually sending Latin1 encoded content, the receiver may produce garbage data, trying to interpret Latin1 encoded data as UTF-8. If of course you specify that you're sending Latin1 encoded data and you're actually doing so, then yes, you're limited to the 256 characters you can encode in Latin1.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • 4
    Of course, in JSON you could still represent non-Latin1 characters using escape sequences like `\u20AC`. – dan04 Feb 13 '12 at 13:28
  • 38
    According to the standard for json, you are not actually allowed to use latin1 for the encoding of the contents. JSON content must be encoded as unicode, be it UTF-8, UTF-16, or UTF-32 (big or little endian). – Daniel Luna Sep 27 '13 at 14:23
  • 31
    There is no charset parameter on application/json. – Julian Reschke Nov 06 '13 at 15:17
  • 9
    @DanielLuna is right, `application/json` has to be in one of the ucs transformation formats. Also, since the first four bytes of JSON are limited, you can always tell if it's 8, 16, or 32 *and* its endian-ness. – Jason Coco May 15 '14 at 05:20
  • 4
    Event if it is redundant you might want to include ``charset=utf-8`` for security reasons: https://github.com/shieldfy/API-Security-Checklist/issues/25 – manuc66 Jul 14 '17 at 13:43
  • Actually using Node.js, trying to include a `charset=utf-8` along `application/json` breaks everything. – Alexis Wilke Feb 08 '19 at 06:55
  • Returning `html/text; charset=utf-8` instead of `html/text` solved the issue I was experiencing with browser not displaying non-ascii letters, when talking with my custom webserver implementation – Kresten Feb 10 '23 at 11:48
163

To substantiate @deceze's claim that the default JSON encoding is UTF-8...

From IETF RFC4627:

JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.

      00 00 00 xx  UTF-32BE
      00 xx 00 xx  UTF-16BE
      xx 00 00 00  UTF-32LE
      xx 00 xx 00  UTF-16LE
      xx xx xx xx  UTF-8
Drew Noakes
  • 300,895
  • 165
  • 679
  • 742
  • 19
    It always helps to think about JSON as binary format, not text format. – Sulthan Jan 12 '15 at 09:48
  • 3
    Now that RFC4627 has been obsoleted by RFC7159, which states that the root value may be a string (in explicit contrast to the former spec), how is this now implemented? The spec is vague in this regard, and just says that three encodings are allowed, but not how one is supposed to differentiate them. – Fabio Beltramini Oct 22 '15 at 20:34
  • 5
    @FabioBeltramini The above should still hold, because a string in JSON will not contain any literal null characters (nulls in JSON would need to be encoded with a numerical escape sequence ie `"\u0000"`). – thomasrutter Oct 28 '15 at 02:00
  • 4
    Actually the second character in UTF-16xx may not have a NULL in that case, but it will still be possible to determine encoding from the other bytes: `xx 00 00 00` is still UTF-32LE and `xx 00 xx xx` is still UTF-16LE, `00 xx xx xx` is still UTF-16BE. – thomasrutter Oct 28 '15 at 02:07
24

Note that IETF RFC4627 has been superseded by IETF RFC7158. In section [8.1] it retracts the text cited by @Drew earlier by saying:

Implementations MUST NOT add a byte order mark to the beginning of a JSON text.
Community
  • 1
  • 1
Alex
  • 241
  • 2
  • 3
  • The assumption still holds though, as any valid json will still start with two ascii characters. – Larsing Dec 05 '17 at 13:29
  • 2
    One character, because a single numeral is a valid JSON file – Nayuki Oct 28 '19 at 03:29
  • [RFC8259](https://datatracker.ietf.org/doc/html/rfc8259#section-8.1): Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. – Andrey Sep 27 '21 at 19:03
6

JSON must be encoded by UTF-8, and there is no "charset" parameter.

RFC 8259 :

  1. IANA Considerations

The media type for JSON text is application/json.
...
Note: No "charset" parameter is defined for this registration. Adding one really has no effect on compliant recipients.

Also,

8.1. Character Encoding

JSON text exchanged between systems that are not part of a closed
ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-
based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves
interoperability.

Implementations MUST NOT add a byte order mark (U+FEFF) to the
beginning of a networked-transmitted JSON text. In the interests of
interoperability, implementations that parse JSON texts MAY ignore
the presence of a byte order mark rather than treating it as an
error.

(emphasis mine)

starriet
  • 2,565
  • 22
  • 23
1

Dart http's implementation process the bytes thanks to that "charset=utf-8", so i'm sure several implementations out there supports this, to avoid the "latin-1" fallback charset when reading the bytes from the response. In my case, I totally lose format on the response body string, so I have to do the bytes encoding manually to utf8, or add that header "inner" parameter on my server's API response.

roipeker
  • 1,183
  • 9
  • 9
1

I was using HttpClient and getting back response header with content-type of application/json, I lost characters such as foreign languages or symbol that used unicode since HttpClient is default to ISO-8859-1. So, be explicit as possible as mentioned by @WesternGun to avoid any possible problem.

There is no way handle that due to server doesn't handle requested-header charset (method.setRequestHeader("accept-charset", "UTF-8");) for me and I had to retrieve response data as draw bytes and convert it into String using UTF-8. So, it is recommended to be explicit and avoid assumption of default value.

Calos
  • 1,783
  • 19
  • 28
Tri Nguyen
  • 11
  • 1
0

I exactly agree with @deceze but I want to develop this "I get an error from the service" part of the question,

We getting this kind of errors as http 415

Http 415 Unsupported Media type error

The HTTP 415 Unsupported Media Type client error response code indicates that the server refuses to accept the request because the payload format is in an unsupported format.

The format problem might be due to the request's indicated Content-Type or Content-Encoding, or as a result of inspecting the data directly.

In other words, such is seen in this example.

  • We have to set the correct content type and we have to accept the right content type as seen Add Content-Type: application/json and Accept: application/json. Otherwise, it will assume the default
wscourge
  • 10,657
  • 14
  • 59
  • 80
Hamit YILDIRIM
  • 4,224
  • 1
  • 32
  • 35