3

So a professor in university just told me that using concatenation on strings in C# (i.e. when you use the plus sign operator) creates memory fragmentation, and that I should use string.Format instead.

Now, I've searched a lot in stack overflow and I found a lot of threads about performance, which concatenating strings win hands down. (Some of them include this, this and this)

I can't find someone who talks about memory fragmentation though. I opened .NET's string.Format using ILspy and apparently it uses the same string builder than the string.Concat method does (which if I understand is what the + sign is overloaded to). In fact: it uses the code in string.Concat!

I found this article from 2007 but I doubt it's accurate today (or ever!). Apparently the compiler is smart enough to avoid that today, cause I can't seem to reproduce the issue. Both adding strings with string.format and plus signs end up using the same code internally. As said before, the string.Format uses the same code string.Concat uses.

So now I'm starting to doubt his claim. Is it true?

Community
  • 1
  • 1
Gaspa79
  • 5,488
  • 4
  • 40
  • 63
  • 3
    Can't say I've ever heard of this. I think it would at least be reasonable to ask for some evidence. Even if this was true a long time ago, it may not be now. – Jon Skeet May 10 '16 at 18:55
  • 2
    I doubt there's any merit to that. Fragmentation comes from allocating and freeing something that both concatenation and formatting do. I would be curious to see his evidence. – Brian Rasmussen May 10 '16 at 18:56
  • 1
    Maybe he's talking about the fact that strings in c# are immutable? – Nasreddine May 10 '16 at 18:58
  • 1
    Even if it were true, it sounds like pre-optimization to me. I think the syntactical niceties of the overloaded + operator will be of greater benefit in the long run. I would only worry about such optimizations after it has been determined that some optimization regarding fragmentation is actually needed in your use case. – bodangly May 10 '16 at 19:19
  • 1
    Now you have 2 comments: Jon Skeet, author of C# in depth and Brian Rasmussen, a Program Manager at Microsoft. – Thomas Weller May 10 '16 at 19:19
  • True: Using `s1 + s2` creates a third string, it does not change `s1` or `s2`. True: Doing this *excessively*, like in a loop to concatenate a large number of smaller strings into one bigger string can benefit from using `StringBuilder` to avoid putting a bigger pressure on the garbage collector than necessary. True: Using `s1 + s2 + s3 + s4` (that is, up to 4 strings) ends up being compiled as a single call to `String.Concat`, which is quite optimized. It will in this case create just 1 new string, not one from s1 + s2, and s on. This has nothing to do with *memory fragmentation* however. – Lasse V. Karlsen May 10 '16 at 19:23
  • 1
    I posted an [example of LOH fragmentation](http://stackoverflow.com/a/30361185/480982). You may modify it to use large strings instead of byte arrays. Try `+` and `Format()` to see if there's a difference. – Thomas Weller May 10 '16 at 19:26
  • 1
    The article you linked to is not *wrong* per se. There are a few points in it that are out of date -- not every version of the builder uses double-when-full, for example. But the more general problem is that it gives performance advice as "tips and tricks" rather than applying an empirical research discipline to a realistic problem. – Eric Lippert May 10 '16 at 20:05
  • 1
    @EricLippert thanks a lot for the answer and this comment! I think I now get what you say by "don't try to achieve it using tips". =) – Gaspa79 May 10 '16 at 20:23

2 Answers2

22

So a professor in university just told me that using concatenation on strings in C# (i.e. when you use the plus sign operator) creates memory fragmentation, and that I should use string.Format instead.

No, what you should do instead is do user research, set user-focussed real-world performance metrics, and measure the performance of your program against those metrics. When, and only when you find a performance problem, you should use the appropriate profiling tools to determine the cause of the performance issue. If the cause is "memory fragmentation" then address that by identifying the causes of the "fragmentation" and trying experiments to determine what techniques mitigate the effect.

Performance is not achieved by "tips and tricks" like "avoid string concatenation". Performance is achieved by applying engineering discipline to realistic problems.

To address your more specific problem: I have never heard the advice to eschew concatenation in favor of formatting for performance reasons. The advice usually given is to eschew iterated concatenation in favor of builders. Iterated concatenation is quadratic in time and space and creates collection pressure. Builders allocate unnecessary memory but are linear in typical scenarios. Neither creates fragmentation of the managed heap; iterated concatenation tends to produce contiguous blocks of garbage.

The number of times I've had a performance problem that came down to unnecessary fragmentation of a managed heap is exactly one; in an early version of Roslyn we had a pattern where we would allocate a small long lived object, then a small short lived object, then a small long lived object... several hundred thousand times in a row, and the resulting maximally fragmented heap caused user-impacting performance problems on collections; we determined this by careful measurement of the performance in the relevant scenarios, not by ad hoc analysis of the code from our comfortable chairs.

The usual advice is not to avoid fragmentation, but rather to avoid pressure. We found during the design of Roslyn that pressure was far more impactful on GC performance than fragmentation, once our aforementioned allocation pattern problem was fixed.

My advice to you is to either press your professor for an explanation, or to find a professor who has a more disciplined approach to performance metrics.

Now, all that said, you should use formatting instead of concatenation, but not for performance reasons. Rather, for code readability, localizability, and similar stylistic concerns. A format string can be made into a resource, it can be localized, and so on.

Finally, I caution you that if you are putting strings together in order to build something like a SQL query or a block of HTML to be served to a user, then you want to use none of these techniques. These applications of string building have serious security impacts when you get them wrong. Use libraries and tools specifically designed for construction of those objects, rather than rolling your own with strings.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
0

The problem with string concatenation is that strings are immutable. string1 + string2 does not concatenate string2 onto string1, it creates a whole new string. Using a StringBuilder (or string.Format) does not have this problem. Internally, the StringBuilder holds a char[], which it over-allocates. Appending something to a StringBuilder does not create any new objects unless it runs out of room in the char[] (in which case it over-allocates a new one).

I ran a quick benchmark. I think it proves the point :)

        StringBuilder sb = new StringBuilder();
        string st;
        Stopwatch sw;

        sw = Stopwatch.StartNew();

        for (int i = 0 ; i < 100000 ; i++)
        {
            sb.Append("a");
        }

        st = sb.ToString();

        sw.Stop();
        Debug.WriteLine($"Elapsed: {sw.Elapsed}");

        st = "";

        sw = Stopwatch.StartNew();

        for (int i = 0 ; i < 100000 ; i++)
        {
            st = st + "a";
        }

        sw.Stop();
        Debug.WriteLine($"Elapsed: {sw.Elapsed}");

The console output:

Elapsed: 00:00:00.0011883 (StringBuilder.Append())

Elapsed: 00:00:01.7791839 (+ operator)

glenebob
  • 1,943
  • 11
  • 11
  • 2
    But it does not create memory *fragmentation*. – Lasse V. Karlsen May 10 '16 at 19:20
  • I'm assuming the term he used is simply not perfectly accurate. However, it can definitely drive the GC nuts, Using string.Format() or a StringBuilder is absolutely the correct advice. – glenebob May 10 '16 at 19:23
  • Hi! Thanks for your answer! Now, I don't understand your reasoning. string.Format will also create a third string: the result string. string.Format does use a string builder yes, but string.Concat doesn't do anything different. It allocates *a single string* no matter how many strings you add and it fills it with the data, so I don't see the memory fragmenation problem there =(. – Gaspa79 May 10 '16 at 19:24
  • 1
    string1 + string2 results in a new string object created, of size string1.Length + string2.Length; using a StringBuilder results in at least three new objects - the StringBuilder, the char[], and the final string when you extract it. How is this better? Performing a lot of concatenations would cause some fragmentation, but depending on the string size, the char[] will still have to be re-allocated repeatedly; StringBuilder is faster, String.Format is better for special-purpose formatting, but citing fragmentation alone as a reason doesn't seem to make sense. – Matt Jordan May 10 '16 at 19:25
  • @glenebob Why is the correct advice? Performance-wise Concat is faster, and memory wise they seem to be equal. I agree that it's much better in terms of styling when you have 2+ strings, but if you want to do something like myStr += "a" is silly to do a string.Format. – Gaspa79 May 10 '16 at 19:25
  • Think of the case where you need to append lots of strings onto an existing one, like string1 = string1 + string2; string1 = string1 + string3;, and so on, as in a loop. Allocations quickly become a problem. – glenebob May 10 '16 at 19:28
  • @glenebob I don't agree, sorry. If you're going to do that you can put all in one line like `string 1 += string2 + string3` and the memory result will still be the same than string.Format (with a negligible performance gain). If you're inside a loop you can use string.Concat to make sure the compiler will translate it. I will upvote you for the intentions anyway, don't worry =). – Gaspa79 May 10 '16 at 19:32
  • 1
    Your benchmark has numerous flaws. Issue one: suppose the first loop causes collection pressure, but not enough to cause a collection. Suppose the second loop causes a collection. The collection cost associated with the first loop has now been charged to the second loop. Issue two: suppose the second loop causes collection pressure, but the program ends before enough pressure has built up to cause a collection. The performance impact of the pressure of the second loop is charged to no one. I could go on in this vein for some time. – Eric Lippert May 10 '16 at 20:10
  • That is of course not to deny that plainly the first loop is linear and the second is quadratic; that is clear. My point though is that the original claim made is (1) that the performance impact of *fragmentation* is germane, and (2) that this cost is relevant in the *non-iterated* concatenation scenario, not the *iterated* concatenation scenario. That iterated concatenation is quadratic is well understood. Your benchmark doesn't address the issue actually under discussion. – Eric Lippert May 10 '16 at 20:12
  • Again, I'm treating the "fragmentation" claim as a typo or something. If the problem with iterated concatenation is so "well understood", why are we having this discussion? Clearly the OP and most commenters in this post do not seem to understand it very well. – glenebob May 10 '16 at 20:16