0

I want to determine the data size of a JSON Java String in Bytes using Java. This calculation should be platform-independant as the software is used on different systems and (possible) different default character encodings (Windows, Linux, zOS, ...). The JSON is supposed to only contain character that are possible to be encoded using UTF-8. By now, in all use cases, there are only characters that can be encoded by 1 byte, however, in future, chinese characters, like e.g. (U+20F2E), are used, too.

Is there a best practice way of calulating the data size in a robust kind of way here?

From what I understand, json.getBytes("UTF-8").length seems to be a valid solution.

Test outputs on windows:

This is a 1Byte UTF-8 character:

@
"@".length() -> 1
"@".getBytes().length -> 1
"@".getBytes("UTF-8").length -> 1
new String("@".getBytes("UTF-8")) -> @
"@".getBytes("UTF-16").length -> 4
new String("@".getBytes("UTF-16")) -> ��

This is a 2Byte UTF-8 character:

µ
"µ".length() -> 1
"µ".getBytes().length -> 2
"µ".getBytes("UTF-8").length -> 2
new String("µ".getBytes("UTF-8")) -> µ
"µ".getBytes("UTF-16").length -> 4
new String("µ".getBytes("UTF-16")) -> ��

This is a 4Byte UTF-8 Character:


"".length() -> 2
"".getBytes().length -> 4
"".getBytes("UTF-8").length -> 4
new String("".getBytes("UTF-8")) -> 
"".getBytes("UTF-16").length -> 6
new String("".getBytes("UTF-16")) -> ���c��

EDIT: The length of the "compressed" JSON should be caluculated, i.e. without any unnecessary whitespaces (from pretty print).

eSKape
  • 71
  • 11
  • Possible duplicate of [What is the Java's internal represention for String? Modified UTF-8? UTF-16?](https://stackoverflow.com/questions/9699071/what-is-the-javas-internal-represention-for-string-modified-utf-8-utf-16) – Jiri Tousek Mar 12 '18 at 09:30
  • JSON allows multiple representations of a character and insignificant whitespace, so what's the meaning of the length of a JSON document? – Tom Blodget Mar 12 '18 at 16:45
  • � is a marker to users that programmers have mishandled their text and lost some of it. – Tom Blodget Mar 12 '18 at 16:51
  • @TomBlodget: in this case we want the length of a compressed JSON, i.e. without all unnecessary whitespaces – eSKape Mar 13 '18 at 08:24

1 Answers1

0

If you have your JSON available as a String with all spaces trimmed, String.getBytes(String charsetName).length should give you the correct size.

Note that in JVM memory String will be encoded in UTF-16 and once write to a file or a database it my used a different encoding (UTF-8, 8859-1...) and so have a different size.

Antoine Mottier
  • 1,185
  • 1
  • 8
  • 13