54

How does StringBuilder work?

What does it do internally? Does it use unsafe code? And why is it so fast (compared to the + operator)?

shA.t
  • 16,580
  • 5
  • 54
  • 111
Alon Gubkin
  • 56,458
  • 54
  • 195
  • 288
  • 4
    If you're curious about the details, you could also just [download the reference source](http://referencesource.microsoft.com/) and look. There's a commented copy of `StringBuilder.cs` at `RefSrcDirectory\Source\.Net\4.0\DEVDIV_TFS\Dev10\Releases\RTMRel\ndp\clr\src\BCL\System\Text\StringBuilder.cs\1305376` . The directory structure I listed may not be exactly the same as that found in the reference source, and is also specific to .Net 4.0. – Brian Jun 29 '11 at 19:16
  • Related: [how-the-stringbuilder-class-is-implemented-does-it-internally-create-new-string](http://stackoverflow.com/questions/3564906/how-the-stringbuilder-class-is-implemented-does-it-internally-create-new-string) – nawfal Jun 10 '14 at 08:17

4 Answers4

84

When you use the + operator to build up a string:

string s = "01";
s += "02";
s += "03";
s += "04";

then on the first concatenation we make a new string of length four and copy "01" and "02" into it -- four characters are copied. On the second concatenation we make a new string of length six and copy "0102" and "03" into it -- six characters are copied. On the third concat, we make a string of length eight and copy "010203" and "04" into it -- eight characters are copied. So far a total of 4 + 6 + 8 = 18 characters have been copied for this eight-character string. Keep going.

...
s += "99";

On the 98th concat we make a string of length 198 and copy "010203...98" and "99" into it. That gives us a total of 4 + 6 + 8 + ... + 198 = a lot, in order to make this 198 character string.

A string builder doesn't do all that copying. Rather, it maintains a mutable array that is hoped to be larger than the final string, and stuffs new things into the array as necessary.

What happens when the guess is wrong and the array gets full? There are two strategies. In the previous version of the framework, the string builder reallocated and copied the array when it got full, and doubled its size. In the new implementation, the string builder maintains a linked list of relatively small arrays, and appends a new array onto the end of the list when the old one gets full.

Also, as you have conjectured, the string builder can do tricks with "unsafe" code to improve its performance. For example, the code which writes the new data into the array can already have checked that the array write is going to be within bounds. By turning off the safety system it can avoid the per-write check that the jitter might otherwise insert to verify that every write to the array is safe. The string builder does a number of these sorts of tricks to do things like ensuring that buffers are reused rather than reallocated, ensuring that unnecessary safety checks are avoided, and so on. I recommend against these sorts of shenanigans unless you are really good at writing unsafe code correctly, and really do need to eke out every last bit of performance.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • 3
    might be worth adding a note to the effect that this is not what would happen if you did string s = x + y + z; (uses String.Concat) in case any silly people decide that they need to 'optimize' all those into StringBuilders (seriously I knew people who did that) – ShuggyCoUk Jun 30 '11 at 09:09
  • 2
    I didn't know that about the new version of StringBuilder - that's a good optimization. If I append a massive string to the string builder, does it use the 'relatively small' array size you mentioned for the newly-created arrays, or does it use a larger size to fit my entire new string? – configurator Jun 30 '11 at 19:57
  • Never mind, I've found the answer myself - a stringbuilder normally expands by as much as is needed, or by doubling its size - whichever is larger. – configurator Jun 30 '11 at 20:01
  • 2
    If you did `string s = "01" + "02" + "03" + "04"` would it compile into `string s = string.Concat("01","02","03","04")`? (Actually I think the compiler would just optimize it to `string s = "01020304"` but would it use String.Concat if all the pieces were not hard coded string values?) – Nick Feb 07 '12 at 16:53
  • @Nick: The compiler would fold the constants. If it could not fold the constants then it would call the appropriate overload of String.Concat. See http://stackoverflow.com/questions/9132338/how-many-string-objects-will-be-created-when-using-a-plus-sign/9135983#9135983 for details. – Eric Lippert Feb 07 '12 at 17:13
  • Would String.Concat create a string of length 8 and copy each piece into it to make only 1 copy of each? – Nick Feb 10 '12 at 16:05
  • 2
    @Nick: Yes. Each *call* to Concat gets the *total* length of all its arguments and allocates a new string exactly big enough. – Eric Lippert Feb 10 '12 at 16:16
24

StringBuilder's implementation has changed between versions, I believe. Fundamentally though, it maintains a mutable structure of some form. I believe it used to use a string which was still being mutated (using internal methods) and would just make sure it would never be mutated after it was returned.

The reason StringBuilder is faster than using string concatenation in a loop is precisely because of the mutability - it doesn't require a new string to be constructed after each mutation, which would mean copying all the data within the string etc.

For just a single concatenation, it's actually slightly more efficient to use + than to use StringBuilder. It's only when you're performing multiple operations and you don't really need the intermediate results that StringBuilder shines.

See my article on StringBuilder for more information.

snipsnipsnip
  • 2,268
  • 2
  • 33
  • 34
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • found your comment from about 6 years ago on this very subject: http://bytes.com/topic/c-sharp/answers/230649-stringbuilder-how-does-internally-work –  Jun 29 '11 at 16:51
3

The Microsoft CLR does do some operations with internal call (not quite the same as unsafe code). The biggest performance benefit over a bunch of + concatenated strings is that it writes to a char[] and doesn't create as many intermediate strings. When you call ToString (), it builds a completed, immutable string from your contents.

agent-j
  • 27,335
  • 5
  • 52
  • 79
  • Would you mind providing more details? Does it redefine the array size when you group items together and create a huge string? Is it just an array of pointers to char arrays (or a linked list) that gets turned into a single object when tostring is called? Would you mind citing a source? – JSWork Jun 29 '11 at 17:24
  • The internals are transparent, but since it has methods like StringBuilder.EnsureCapacity, it makes one believe it's a single large buffer that grows if necessary. – agent-j Jun 29 '11 at 17:27
  • Wouldn't that be less efficient than using a linked list and merging it at the end? I mean what if you have a one meg string that you are appending? You'd have to create a copy of something that already exists that takes time and resources. If you just made a pointer to the original you wouldn't have to worry abut it changing because it's immutable and the gac wouldn't touch it to delete it because of your reference to the immutable string. – JSWork Jun 29 '11 at 20:46
  • 1
    @JSWork, what if I say `stringBuilder.Remove(1023, 2000)`. That's a complicated algorithm if you have a linked list of strings. I'm sure that wouldn't be as efficient. However, feel free to implement your own LLStringBuilder class this way if you know you will not need features like Insert, Remove, Replace, etc.. – agent-j Jun 29 '11 at 20:51
  • @agent-j: I guess it's a tradeoff. You'll have to parse through the linked list and put it into a string eventually - my main objective is to reallocate the array lenght as few times as possible. Using the linked list for append lets you do that. You can have a separate private method that creates a char array when needed (removeat, tostring, etc) based off the total size of the linked list (which replaces the linked list)- you'll need it for tostring anyhow. What I'm looking for is redimming hte array as little as possible. GC reallocation due to the array not fitting could really hamper – JSWork Jun 29 '11 at 21:37
  • 1
    @JSWork, with `new StringBuilder(2048*1024)` you can specify a large-enough initial capacity and that will minimize the redimensioning cost. (You probably know that already, but it might benefit a future reader.) – agent-j Jun 29 '11 at 21:44
2

The StringBuilder uses a string buffer that can be altered, compared to a regular String that can't be. When you call the ToString method of the StringBuilder it will just freeze the string buffer and convert it into a regular string, so it doesn't have to copy all the data one extra time.

As the StringBuilder can alter the string buffer, it doesn't have to create a new string value for each and every change to the string data. When you use the + operator, the compiler turns that into a String.Concat call that creates a new string object. This seemingly innocent piece of code:

str += ",";

compiles into this:

str = String.Concat(str, ",");
Guffa
  • 687,336
  • 108
  • 737
  • 1,005