2

What is the best / cleanest way of determining size of Java String after conversion into byte array, preferably without first converting it into the array?

-EDIT By size I mean length of byte array that would be result of getBytes method

Magik6k
  • 115
  • 2
  • 8

1 Answers1

1

There are encodings where you can know the length in bytes - all those which use fixed byte length per character. Examples are US-ASCII, ISO-8859-1 and UTF-16.

When using variable byte length, like the popular UTF-8 you cannot know a priory the length of the byte array.

David Rabinowitz
  • 29,904
  • 14
  • 93
  • 125
  • But it isn't so easy to know how many characters are in a string (as opposed to how many chars are in a string, which is trivial). Consider high and low surrogates, as in a string which contains, say, Egyptian hieroglyphics. Also, "a priori". – David Conrad Jan 08 '15 at 19:56
  • I think you can from the String.length() javadoc: "Returns the length of this string. The length is equal to the number of Unicode code units in the string." and "Returns: the length of the sequence of characters represented by this object." – David Rabinowitz Jan 08 '15 at 20:01
  • That's really only true for the BMP, the Basic Multilingual Plane of Unicode. Java `Strings` are stored in UTF-16BE format, so the character U+13000 is stored as the surrogate pair U+D80C, U+DC00, and `String.length()` just returns the length of the underlying `char[]`. – David Conrad Jan 08 '15 at 20:06
  • `new StringBuilder().appendCodePoint(0x13000).toString().length()` – David Conrad Jan 08 '15 at 20:08
  • It's really unfortunate, there are also bugs with `equalsIgnoreCase`, although there aren't many scripts outside the BMP that actually have cased letters. (The Deseret script is the only one I know of.) – David Conrad Jan 08 '15 at 20:09
  • However, I think that for most languages, String.length() returns the correct answer in characters. Hieroglyphics, while being an interesting case, is a really edge case I think 99% of the software in the world won't encounter (and the same for most Unicode characters above 2^16) – David Rabinowitz Jan 08 '15 at 20:12
  • Yes, it will work for much more than 99% of cases. But it will break for *all* Unicode characters over 2^16. I just picked hieroglyphs because they're outside the BMP. It's being over 64k that breaks it. Actually converting to a byte[] and checking the length works 100% of the time. – David Conrad Jan 08 '15 at 20:14