3

Note: This question is about Java >= 9 which introduced "compact strings"


Let's say I am appending an unknown number of strings (or chars) to a StringBuilder and at some point determine that I am appending the last string.

How can this be done efficiently?

Background

If the capacity of the string builder is not large enough it will always increase it to max(oldCap + str.lenght(), oldCap * 2 + 2). So if you are unlucky and the capacity is not enough for the last string, it will unnecessarily double the capcity, e.g.:

StringBuilder sb = new StringBuilder(4000);
sb.append("aaa..."); // 4000 * "a"
// Last string:
sb.append("b"); // Unnecessarily increases capacity from 4000 to 8002
return sb.toString();

StringBuilder offers the methods capacity(), length() and getChars(...), however manually creating a char[] and then creating a string will be inefficient because:

  • Due to "compact strings" the string builder has to convert its bytes to chars
  • When calling one of the String constructors the chars have to be compacted to bytes again

Another option would be to check capacity() and if necessary create a new StringBuilder(sb.length() + str.length()), then append sb and str:

StringBuilder sb = new StringBuilder(4000);
sb.append("aaa..."); // 4000 * "a"

String str = "b";
if (sb.capacity() - sb.length() < str.length()) {
    return new StringBuilder(sb.length() + str.length())
        .append(sb)
        .append(str)
        .toString();
}
else {
    return sb.append(str).toString();
}

The only disadvantage is that if the existing string builder or the new string is non-Latin 1 (2 bytes per char), the newly created string builder has to be "inflated" from 1 byte per char (Latin 1) to 2 bytes per char.

Marcono1234
  • 5,856
  • 1
  • 25
  • 43
  • 1
    Use `return sb.toString() + str;` since the specialists advice is to "Use + instead of StringBuilder where possible" (see "Lessons from Today" [here](https://www.javaspecialists.eu/talks/pdfs/2018%20Voxxed%20in%20Thessaloniki,%20Greece%20-%20%22Enough%20java.lang.String%20to%20Hang%20Ourselves%20...%22%20by%20Heinz%20Kabutz.pdf) (found via this [article](https://dzone.com/articles/jdk-9jep-280-string-concatenations-will-never-be-t))). I don't know these specialists though, so you might have to look into their credibility. – vanOekel Nov 02 '19 at 20:27
  • Good point, Aleksey Shipilëv also described it in [this presentation](https://youtu.be/wIyeOaitmWM?t=3186). In my case I am actually dealing with `char[]`, however converting them to strings and then using string concatenation might still be more efficient. – Marcono1234 Nov 02 '19 at 23:31
  • 1
    @vanOekel it's 1/2 correct what they say, it is not "directly" compiled to `StringBuilder`, but 4 out of 5 strategies [still](https://stackoverflow.com/a/46516179/1059372) use it under the hood. – Eugene Nov 03 '19 at 18:11
  • 1
    @vanOekel and of course "where possible", is simply wrong, when you are concatenating in a loop - nothing changes, a `StringBuilder` directly is [more efficient](https://stackoverflow.com/questions/49458564/string-concatenation-in-a-for-loop-java-9) – Eugene Nov 03 '19 at 19:47
  • 1
    @vanOekel but then, rather `return sb + str;`, without forcing the creation of another intermediate `String` instance. – Holger Nov 06 '19 at 08:06

1 Answers1

1

You are describing separate different problems IMO, but neither of them is an "actual" problem.

First, is the fact that StringBuilder allocates too much space - that is rarely (if ever) a problem in practice. Think about any List/Set/Map - they do the same thing, might allocate too much, but when you remove an element, they don't shrink their internal storage. They do have a method for that; but so does StringBuilder:

 trimToSize

Due to "compact strings" the string builder has to convert its bytes to chars.

StringBuilder knows what it is storing via the coder field in AbstractStringBuilder which it extends. With compact Strings, String holds its data in a byte[] now (it has a coder too), thus I don't understand where that conversion from byte[] to char[] is supposed to happen. StringBuilder::toString is defined as:

public String toString() {
    // Create a copy, don't share the array
    return isLatin1() ? StringLatin1.newString(value, 0, count)
                      : StringUTF16.newString(value, 0, count);
}

Notice the isLatin1 check - StringBuilder knows what type of data it has internally; thus no conversion when possible.

I assume that by this:

When calling one of the String constructors the chars have to be compacted to bytes again

you mean:

char [] some = ...
String s = new String(some);

I don't know why you are using again here, but may be I am missing something. Just notice that this conversion from char[] to byte[] indeed has to happen, but it's fairly trivial to do (the last 8 bits have to be empty), and as soon as a single char does not meet the precondition, the entire conversion is bailed out. So you either store all characters in LATIN1, or you don't.

Eugene
  • 117,005
  • 15
  • 201
  • 306
  • The two quotes refer to the preceding sentence: "however manually creating a char[] and then creating a string will be inefficient because" – Marcono1234 Nov 05 '19 at 15:36
  • @Marcono1234 can you may be provide an example of what you have in mind? I seems to not get it, might be my mistake? – Eugene Nov 05 '19 at 15:38
  • The "Background" section describes why it is inefficient and what alternatives exist and why I am not happy with them either. One of them would be to create the string manually, e.g. consider: `StringBuilder sb` then when adding the last string (`str`) do the following: `char[] buff = new char[sb.length() + str.length()]; sb.getChars(0, sb.length(), buff, 0); str.getChars(0, str.length(), buff, sb.length()); return new String(buff)` – Marcono1234 Nov 05 '19 at 15:58
  • @Marcono1234 this focus on the last element looks irrational to me. What if the unnecessarily large capacity increase operation happens at the forelast element? Or third from last? – Holger Nov 05 '19 at 17:04
  • @Holger, then there is nothing you can do about it because you don't know how much elements will follow. However for the last element you know it is the last and should be able to optimize it. – Marcono1234 Nov 05 '19 at 17:39
  • @Marcono1234 so you're wasting time optimizing a rare corner case that may happen once every 1000 runs. Even worse, your optimization may actually defeat the JVM's optimizations. – Holger Nov 05 '19 at 17:44
  • @Holger, but if the requirement for this corner case is `if (sb.capacity() - sb.length() < str.length())` (all simple getters) and it succeeds 999 times, then the JVM should be able to optimize this as well, shouldn't it? I am also not asking for cases where the difference would be a 4K vs 8K array, but rather for example users might be reading Base64 binary data (as string instead of using a `Reader`) which can easily get pretty large. – Marcono1234 Nov 05 '19 at 18:29
  • @Marcono1234 there are JVM optimizations for string concatenation, which might fix the capacity or even eliminate the otherwise unavoidable copying operation between the builder and the resulting string. But I can’t predict whether they will apply to your case. What can be predicted, is that attempts to optimize this manually change the code pattern from “very common” to “rather uncommon”, which will make it less likely to be optimized. I’ve see that before, optimizations looking good in source code actually degrading performance. But when dealing with base64, just fix the initial capacity… – Holger Nov 06 '19 at 08:03