11

I have the following code inside a loop statement.
In the loop, strings are appended to sb(StringBuilder) and checked whether the size of sb has reached 5MB.

if (sb.toString().getBytes("UTF-8").length >= 5242880) {
    // Do something
}

This works fine, but it is very slow(in terms of checking the size)
What would be the fastest way to do this?

Holger
  • 285,553
  • 42
  • 434
  • 765
d-_-b
  • 4,142
  • 6
  • 28
  • 43

3 Answers3

15

You can calculate the UTF-8 length quickly using

public static int utf8Length(CharSequence cs) {
    return cs.codePoints()
        .map(cp -> cp<=0x7ff? cp<=0x7f? 1: 2: cp<=0xffff? 3: 4)
        .sum();
}

If ASCII characters dominate the contents, it might be slightly faster to use

public static int utf8Length(CharSequence cs) {
    return cs.length()
         + cs.codePoints().filter(cp -> cp>0x7f).map(cp -> cp<=0x7ff? 1: 2).sum();
}

instead.

But you may also consider the optimization potential of not recalculating the entire size, but only the size of the new fragment you’re appending to the StringBuilder, something alike

    StringBuilder sb = new StringBuilder();
    int length = 0;
    for(…; …; …) {
        String s = … //calculateNextString();
        sb.append(s);
        length += utf8Length(s);
        if(length >= 5242880) {
            // Do something

            // in case you're flushing the data:
            sb.setLength(0);
            length = 0;
        }
    }

This assumes that if you’re appending fragments containing surrogate pairs, they are always complete and not split into their halves. For ordinary applications, this should always be the case.

An additional possibility, suggested by Didier-L, is to postpone the calculation until your StringBuilder reaches a length of the threshold divided by three, as before that, it is impossible to have a UTF-8 length greater than the threshold. However, that will be only beneficial if it happens that you don’t reach threshold / 3 in some executions.

Holger
  • 285,553
  • 42
  • 434
  • 765
  • 4
    As a further optimization, seeing that a character takes at most 3 bytes, you could also avoid computing the length until the `StringBuilder` length reaches 5MB/3. – Didier L Apr 24 '17 at 13:16
  • @Holger In jdk-9 there will be `String::codePoints` that will make the difference between ASCII and non-ASCII Strings... Also this technique works only for UTF-8, it's still nice. – Eugene Apr 25 '17 at 07:09
  • 1
    @Eugene: calculating the `UTF-8` length is the sole purpose of this exercise. Besides that, Java 9’s implementation of `codePoints()` will not make a difference to this answer. The difference between the two solutions of this answer is that the second executes only one conditional for ASCII characters and skips the addition operation. After fixing a mistake, the two variants don’t differ in the worst case anymore, so the 2nd always wins . A cheap “isAllASCII” method would be helpful, but as far as I know, Java 9 is only going to differentiate between iso-latin-1 and other strings internally. – Holger Apr 25 '17 at 12:50
  • 1
    @DidierL: "*a character takes at most 3 bytes*" - A single `char`, yes, but Java strings represent Unicode codepoints in UTF-16, so there may be 1 or 2 `char`s per codepoint in the string. In standard UTF-8, codepoints can be encoded up to 4 bytes, where a codepoint that is encoded up to 3 bytes only requires 1 Java `char`, but a codepoint encoded as 4 bytes requires 2 Java `char`s acting together. – Remy Lebeau Apr 25 '17 at 21:15
  • 1
    @RemyLebeau that does not change my reasoning since the `length()` of a `String`/`StringBuilder` is the number of `char`s, so if a codepoint takes 2 `char`s it would count as _at most 6 bytes_ which is still overestimated and thus compatible with this optimization. – Didier L Apr 26 '17 at 12:01
  • @Remy Lebeau: in addition to Didier L’s explanation, you may look at my second variant of `utf8Length`, which already takes the relationship between surrogate characters and the `UTF-8` representation into account. The result is the String length, i.e. `char`s per String, plus up to `2` per codepoint, hence, it is impossible to get more than `3` per `char`. For characters outside the BMP, it will count two `char`s plus `2` for the codepoint, resulting in `4`, which is the number of `UTF-8` bytes for the codepoint, but effectively only `2` bytes for each `char`. – Holger Apr 27 '17 at 07:28
9

If you loop 1000 times, you will generate 1000String, then convert into "UTF-8 Byte" array, to get the length.

I would reduce the conversion by storing the first length. Then, on each loop, get the length of the added value only, then this is just an addition.

int length = sb.toString().getBytes("UTF-8").length;
for(String s : list){
    sb.append(s);
    length += s.getBytes("UTF-8").length;
    if(...){
    ...
    }
}

This would reduce the memory used and the conversion cost

AxelH
  • 14,325
  • 2
  • 25
  • 55
2

Consider using a ByteArrayOutputStream and an OutputStreamWriter instead of the StringBuilder. Use ByteArrayOutputStream.size() to test the size.

Maurice Perry
  • 9,261
  • 2
  • 12
  • 24