53

Does anyone know if the standard Java library (any version) provides a means of calculating the length of the binary encoding of a string (specifically UTF-8 in this case) without actually generating the encoded output? In other words, I'm looking for an efficient equivalent of this:

"some really long string".getBytes("UTF-8").length

I need to calculate a length prefix for potentially long serialized messages.

Trevor Robinson
  • 15,694
  • 5
  • 73
  • 72
  • 1
    If your problem is raw speed, and not memory, are you sure that an ad-hoc function would be faster than `getBytes` + `length`? The current JREs have quite fast conversion routines implemented in native code. – gioele Dec 19 '11 at 14:04
  • 1
    I'm concerned about memory pressure too, but the main concern is that large allocations could cause more garbage collection overhead. Rather than introducing a potential performance issue (that needs to be verified by profiling, of course), I thought I'd ask if there was a more specific API available. BTW, Oracle's JRE does not use native code for this: they allocate a worst-case byte array (`maxBytesPerChar`) and use an array-based implementation of CharsetEncoder (see `sun.nio.cs.UTF_8`). – Trevor Robinson Dec 19 '11 at 23:29

4 Answers4

53

Here's an implementation based on the UTF-8 specification:

public class Utf8LenCounter {
  public static int length(CharSequence sequence) {
    int count = 0;
    for (int i = 0, len = sequence.length(); i < len; i++) {
      char ch = sequence.charAt(i);
      if (ch <= 0x7F) {
        count++;
      } else if (ch <= 0x7FF) {
        count += 2;
      } else if (Character.isHighSurrogate(ch)) {
        count += 4;
        ++i;
      } else {
        count += 3;
      }
    }
    return count;
  }
}

This implementation is not tolerant of malformed strings.

Here's a JUnit 4 test for verification:

public class LenCounterTest {
  @Test public void testUtf8Len() {
    Charset utf8 = Charset.forName("UTF-8");
    AllCodepointsIterator iterator = new AllCodepointsIterator();
    while (iterator.hasNext()) {
      String test = new String(Character.toChars(iterator.next()));
      Assert.assertEquals(test.getBytes(utf8).length,
                          Utf8LenCounter.length(test));
    }
  }

  private static class AllCodepointsIterator {
    private static final int MAX = 0x10FFFF; //see http://unicode.org/glossary/
    private static final int SURROGATE_FIRST = 0xD800;
    private static final int SURROGATE_LAST = 0xDFFF;
    private int codepoint = 0;
    public boolean hasNext() { return codepoint < MAX; }
    public int next() {
      int ret = codepoint;
      codepoint = next(codepoint);
      return ret;
    }
    private int next(int codepoint) {
      while (codepoint++ < MAX) {
        if (codepoint == SURROGATE_FIRST) { codepoint = SURROGATE_LAST + 1; }
        if (!Character.isDefined(codepoint)) { continue; }
        return codepoint;
      }
      return MAX;
    }
  }
}

Please excuse the compact formatting.

Community
  • 1
  • 1
McDowell
  • 107,573
  • 31
  • 204
  • 267
  • 3
    That should work, but it's needlessly complicated: you don't need to support 5- and 6-byte characters (since Unicode doesn't allow, and UTF-16 can't represent, codepoints that high), and if `Character.isHighSurrogate(ch)`, then you don't actually need to determine the codepoint: the set of codepoints that require surrogate pairs in UTF-16 is the same as the set of codepoints that require four bytes in UTF-8. Therefore, if it's O.K. not to support invalid surrogate pairs, then you can just write – ruakh Dec 15 '11 at 03:12
  • 6
    `if(ch <= '\x7F') ++count; else if(ch <= '\u07FF') count += 2; else if(Character.isHighSurrogate(ch)) { count += 4; ++i; } else count += 3;`. But +1 for including a super-comprehensive unit-test. :-) – ruakh Dec 15 '11 at 03:13
  • @ruakh - all good points; I've updated the answer with your implementation. – McDowell Dec 15 '11 at 09:24
  • @McDowell, I think there's an error in your unit test code. The current code excludes MAX, but MAX is a valid codepoint, isn't it? – 0xbe5077ed Apr 06 '16 at 04:38
  • 1
    The test case could be even more compact when using a simple `for` loop instead of that `AllCodepointsIterator`. Further, there are already convenient constants in `java.lang.Character`: `for(int codepoint=Character.MIN_CODE_POINT; codepoint<=Character.MAX_CODE_POINT; codepoint++) { if(codepoint == Character.MIN_SURROGATE) codepoint=Character.MAX_SURROGATE+1; if(!Character.isDefined(codepoint)) continue; String test = new String(Character.toChars(codepoint)); Assert.assertEquals(test.getBytes(utf8).length, Utf8LenCounter.length(test)); }` – Holger Apr 24 '17 at 13:58
25

Using Guava's Utf8:

Utf8.encodedLength("some really long string")
Aaron Feldman
  • 474
  • 4
  • 7
1

The best method I could come up with is to use CharsetEncoder to write repeatedly into the same temporary buffer:

public int getEncodedLength(CharBuffer src, CharsetEncoder encoder)
    throws CharacterCodingException
{
    // CharsetEncoder.flush fails if encode is not called with >0 chars
    if (!src.hasRemaining())
        return 0;

    // encode into a byte buffer that is repeatedly overwritten
    final ByteBuffer outputBuffer = ByteBuffer.allocate(1024);

    // encoding loop
    int bytes = 0;
    CoderResult status;
    do
    {
        status = encoder.encode(src, outputBuffer, true);
        if (status.isError())
            status.throwException();
        bytes += outputBuffer.position();

        outputBuffer.clear();
    }
    while (status.isOverflow());

    // flush any remaining buffered state
    status = encoder.flush(outputBuffer);
    if (status.isError() || status.isOverflow())
        status.throwException();
    bytes += outputBuffer.position();

    return bytes;
}

public int getUtf8Length(String str) throws CharacterCodingException
{
    return getEncodedLength(CharBuffer.wrap(str),
        Charset.forName("UTF-8").newEncoder());
}
Trevor Robinson
  • 15,694
  • 5
  • 73
  • 72
0

You can loop thru the String:

/**
 * Deprecated: doesn't support surrogate characters.
 */
@Deprecated
public int countUTF8Length(String str)
{
    int count = 0;
    for (int i = 0; i < str.length(); ++i)
    {
        char c = str.charAt(i);
        if (c < 0x80)
        {
            count++;
        } else if (c < 0x800)
        {
            count +=2;
        } else
            throw new UnsupportedOperationException("not implemented yet");
        }
    }
    return count;
}
Martijn Courteaux
  • 67,591
  • 47
  • 198
  • 287
  • 3
    Close, but not quite: you're not handling surrogate characters properly. In particular, `c < 0x10000` (which is what you meant when you wrote `c < 0x1000`) is guaranteed to be true, because code-points outside the BMP are expressed as *two* Java characters (using surrogate code-points). – ruakh Dec 14 '11 at 21:07
  • @ruakh: Ah, yes I see. Indeed. That is correct. I thought to make a quick solution, but the surrogate characters are indeed a problem to be completely correct... But if it is outside the OP his needs, this will satisfy. – Martijn Courteaux Dec 14 '11 at 21:10
  • This would return 8 for a surrogate pair, which wouldn't be correct. For such subtle reasons, I'm trying to avoid writing this code myself. – Trevor Robinson Dec 14 '11 at 21:11
  • Is there even a guarantee on back and forth between 8-16 on combining characters vs precomposed characters? (That the encoder implementation observes?) I would be dubious of trusting something not generated by the encoder that will crank out the final output. – Affe Dec 14 '11 at 21:29
  • @Affe: I don't know of such a guarantee, but I find it *really* hard to believe that anyone would write an encoder that implicitly (and silently) modified the character sequence. I mean, that would mean that encoding and then decoding the sequence would result in a new string that isn't `equals`-equivalent to the old one! – ruakh Dec 14 '11 at 21:40