Java: get length of string when repressented as bytes of some encoding

Question

What is the best / cleanest way of determining size of Java String after conversion into byte array, preferably without first converting it into the array?

-EDIT By size I mean length of byte array that would be result of getBytes method

Wait, do you want the size of the `String` or the size of the `byte[]`? — Sotirios Delimanolis, Jan 08 '15 at 19:50
"Size" needs some more detail. Do you mean predicting the array size of the byte array? — Gus, Jan 08 '15 at 19:51
@christopher I'm given few different encodings(Charset instaces) — Magik6k, Jan 08 '15 at 19:51
You would have to redo the work the charset is going to do in converting. In some cases it would be straightforward, but there's always the chance your implementation would have bugs. I'm not sure if this is a good idea. — David Conrad, Jan 08 '15 at 19:54
There is "AverageBytesPerChar", which may be of some use: http://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetEncoder.html#averageBytesPerChar() — Gus, Jan 08 '15 at 19:54
Did you see this: http://stackoverflow.com/questions/229886/size-of-a-byte-in-memory-java — Jeff Anderson, Jan 08 '15 at 19:55
Oh, and MaxBytesPerChar, which is what you'll have to support anyway — Gus, Jan 08 '15 at 19:55
@user2348184 I'm aware of Java memory alignment, heap memory-saving is not the case here — Magik6k, Jan 08 '15 at 19:59
There is a relatively simple algorithm which you can use to examine each byte of a UTF8 string in turn in turn and determine which are 1, 2, and 4-byte codes. I coded it several times when working on the IBM iSeries JVM. I just don't remember what it is. — Hot Licks, Jan 08 '15 at 20:02
@HotLicks Was it something like [this](http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html)? — ajb, Jan 08 '15 at 20:08

score 1 · Accepted Answer · answered Jan 08 '15 at 19:54

1

There are encodings where you can know the length in bytes - all those which use fixed byte length per character. Examples are US-ASCII, ISO-8859-1 and UTF-16.

When using variable byte length, like the popular UTF-8 you cannot know a priory the length of the byte array.

answered Jan 08 '15 at 19:54

David Rabinowitz

29,904
14
93
125

But it isn't so easy to know how many characters are in a string (as opposed to how many chars are in a string, which is trivial). Consider high and low surrogates, as in a string which contains, say, Egyptian hieroglyphics. Also, "a priori". – David Conrad Jan 08 '15 at 19:56
I think you can from the String.length() javadoc: "Returns the length of this string. The length is equal to the number of Unicode code units in the string." and "Returns: the length of the sequence of characters represented by this object." – David Rabinowitz Jan 08 '15 at 20:01
That's really only true for the BMP, the Basic Multilingual Plane of Unicode. Java `Strings` are stored in UTF-16BE format, so the character U+13000 is stored as the surrogate pair U+D80C, U+DC00, and `String.length()` just returns the length of the underlying `char[]`. – David Conrad Jan 08 '15 at 20:06
`new StringBuilder().appendCodePoint(0x13000).toString().length()` – David Conrad Jan 08 '15 at 20:08
It's really unfortunate, there are also bugs with `equalsIgnoreCase`, although there aren't many scripts outside the BMP that actually have cased letters. (The Deseret script is the only one I know of.) – David Conrad Jan 08 '15 at 20:09
However, I think that for most languages, String.length() returns the correct answer in characters. Hieroglyphics, while being an interesting case, is a really edge case I think 99% of the software in the world won't encounter (and the same for most Unicode characters above 2^16) – David Rabinowitz Jan 08 '15 at 20:12
Yes, it will work for much more than 99% of cases. But it will break for *all* Unicode characters over 2^16. I just picked hieroglyphs because they're outside the BMP. It's being over 64k that breaks it. Actually converting to a byte[] and checking the length works 100% of the time. – David Conrad Jan 08 '15 at 20:14

Java: get length of string when repressented as bytes of some encoding

1 Answers1