Help understanding C# optimization

Question

I was playing with C# and wanted to speed up a program. I made changes and was able to do so. However, I need help understanding why the change made it faster.

I've attempted to reduce the code to something easier to understand in a question. Score1 and Report1 is the slower way. Score2 and Report2 is the faster way. The first method first stores a string and an int in a struct in parallel. Next, in a serial loop, it loops through an array of those structs and writes their data to a buffer. The second method first writes the data to a string buffer in parallel. Next, in a serial loop, it writes the string data to a buffer. Here are some sample run times:

Run 1 Total Average Time = 0.492087 sec Run 2 Total Average Time = 0.273619 sec

When I was working with an earlier non-parallel version of this, the times were almost the same. Why the difference with the parallel version?

Even if I reduce the loop in Report1 to write a single line of output to the buffer it is still slower (total time about .42 sec).

Here is the simplified code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Threading.Tasks;
using System.IO;

namespace OptimizationQuestion
{
    class Program
    {
        struct ValidWord
        { 
            public string word;
            public int score;
        }
        ValidWord[] valid;
        StringBuilder output;
        int total; 

        public void Score1(string[] words)
        {
            valid = new ValidWord[words.Length];

            for (int i = 0; i < words.Length; i++)
            {
                StringBuilder builder = new StringBuilder();

                foreach (char c in words[i])
                {
                    if (c != 'U')
                        builder.Append(c);
                }
                if (words[i].Length == 3)
                {
                    valid[i] = new ValidWord 
                    { word = builder.ToString(), score = words[i].Length };
                }
            }
        }
        public void Report1(StringBuilder outputBuffer)
        {
            int total = 0;
            foreach (ValidWord wordInfo in valid)
            {
                if (wordInfo.score > 0)
                {
                    outputBuffer.AppendLine(String.Format("{0} {1}", wordInfo.word.ToString(), wordInfo.score));
                    total += wordInfo.score;
                }
            }
            outputBuffer.AppendLine(string.Format("Total = {0}", total));
        }

        public void Score2(string[] words)
        {
            output = new StringBuilder();
            total = 0;           
            for (int i = 0; i < words.Length; i++)
            {
                StringBuilder builder = new StringBuilder();

                foreach (char c in words[i])
                {
                    if (c != 'U')
                        builder.Append(c);
                }
                if (words[i].Length == 3)
                {
                    output.AppendLine(String.Format("{0} {1}", builder.ToString(), words[i].Length));
                    total += words[i].Length;
                }
            }
        }
        public void Report2(StringBuilder outputBuffer)
        {
            outputBuffer.Append(output.ToString());
            outputBuffer.AppendLine(string.Format("Total = {0}", total));
        } 
        static void Main(string[] args)
        {
            Program[] program = new Program[100];
            for (int i = 0; i < program.Length; i++)
                program[i] = new Program(); 

            string[] words = File.ReadAllLines("words.txt");

            Stopwatch stopwatch = new Stopwatch();
            const int TIMING_REPETITIONS = 20;
            double averageTime1 = 0.0;
            StringBuilder output = new StringBuilder();
            for (int i = 0; i < TIMING_REPETITIONS; ++i)
            {
                stopwatch.Reset();
                stopwatch.Start();
                output.Clear();
                Parallel.ForEach<Program>(program, p =>
                    {
                        p.Score1(words);
                    });
                for (int k = 0; k < program.Length; k++)
                    program[k].Report1(output);
                stopwatch.Stop();
                averageTime1 += stopwatch.Elapsed.TotalSeconds;
                GC.Collect();
            }
            averageTime1 /= (double)TIMING_REPETITIONS;
            Console.WriteLine(string.Format("Run 1 Total Average Time = {0:0.000000} sec", averageTime1));
            double averageTime2 = 0.0;
            for (int i = 0; i < TIMING_REPETITIONS; ++i)
            {
                stopwatch.Reset();
                stopwatch.Start();
                output.Clear();
                Parallel.ForEach<Program>(program, p =>
                    {
                        p.Score2(words);
                    });
                for (int k = 0; k < program.Length; k++)
                    program[k].Report2(output);
                stopwatch.Stop();
                averageTime2 += stopwatch.Elapsed.TotalSeconds;
                GC.Collect();
            }
            averageTime2 /= (double)TIMING_REPETITIONS;
            Console.WriteLine(string.Format("Run 2 Total Average Time = {0:0.000000} sec", averageTime2));
            Console.ReadLine();
        }
    }
}

Why are you trying to rank such different code as Report1 and Report2? Report1 contains a loop and Report2 doesn't. Maybe in the non-parallel version the C# compiler unrolled the loop or some other magic? — Earlz, Feb 09 '11 at 05:32
Reducing Report1 loop to one iteration helps a little (.42 sec), but after posting, I think it is the array allocation in Score1 . — jlim, Feb 09 '11 at 05:35
Note: the word list is about 14,000 lines of strings. So each call of score1 allocates 14,000 structs. — jlim, Feb 09 '11 at 06:02
I think you should try to do some profiling. It's difficult to tell exactly why it's slower without proper measurement. It's true that allocations are costly, but from your previous comment, I would think that a new[] of struct would translate to: malloc(sizeof(struct) * size); which would be kind of fast. Structs are not stored as separate objects in arrays, but are grouped together. — Sauleil, Feb 16 '11 at 20:17

Cornelius · Answer 1 · 2011-02-09T08:06:55.283

1

The size of a struct should typically be less than that of a pointer (if performance is the primary issue. Microsoft says that anything less than 16 bytes performs better as a struct if reference type semantics aren't needed), else the overhead for passing it around increases (because it is pass by value) and would be more than it would have been for just passing a pointer. Your struct contains a pointer and an int (making it more than a pointer) so you would be experiencing overhead because of this.

See the When to use structs section of this article.

edited Feb 09 '11 at 08:06

answered Feb 09 '11 at 06:41

Cornelius

3,526
7
37
49

*The size of a struct should typically be less than that of a pointer (Microsoft recommends not more than 16 bytes)* - A reference in C# is nowhere near 16 bytes. Hard to make a useful struct type that is less than the 32/64 (+ some housekeeping overhead) bits that a reference requires. – Ed S. Feb 09 '11 at 07:40
@Ed S - that is why it is only a recommendation and not an enforcement. Your structs could be larger, just bare in mind this does come with a performance hit (which you should weigh up and consider when deciding on a struct or class as Microsoft has with many of there structs). – Cornelius Feb 09 '11 at 07:44
2

Ok, but it's a bad recommendation. I mean really, please show me a useful struct that is <= the size of a reference. By your recommendation every struct type in the framework the framework itself fails, so I'm not sure how you can say it would be "typical". – Ed S. Feb 09 '11 at 07:45
2

There's no pointers in the struct but a _reference_ to a string and an int – Rune FS Feb 09 '11 at 07:50
@Ed S There are other things to consider as well that can mean it is better to use a struct, but this question does **not** deal with structs as such but performance and this one aspects of structs **does** impact performance. The recommendation is only on the _performance_ side not on good usage of structs, but this question is concerned by the _performance_. – Cornelius Feb 09 '11 at 07:52
I'm not debating the rest of your post, I'm just saying that your recommendation is flat out wrong and makes absolutely no sense. – Ed S. Feb 09 '11 at 07:55
@Ed S. I have edited the answer to make it clear this is only if _performance_ is key (as the OP is concerned about that) – Cornelius Feb 09 '11 at 08:08
3

Still though, you can't just say that "performance is worse if your struct is larger than a reference". It depends on the situation. For example, structs can give you more deterministic performance in space, but if you are passing them around a lot you pay a penalty for the copies that are created. You can't boil it down to a statement like "if your structs are larger than a reference you take a performance hit". It's not that simple, and the first statement is still wrong. – Ed S. Feb 09 '11 at 09:05

score 1 · Answer 2 · answered Feb 17 '11 at 20:21

I tried running it through a profiler, but I'm not trusting the results I got. (Run1 takes less time than Run2 in it.) So there aren't any concrete answers there, but my suspicion is that the valid[] array is the culprit:

That's a potentially large memory allocation that Run1 is doing and Run2 isn't. Allocating big chunks of memory can be time-consuming.
It's possible that array is ending up far from any other working data in physical memory. At the very least, it's big enough to end up in the large object heap, whereas it looks like most everything else is going to end up on the stack or small object heap. That might mean that the Score1 function is having to deal with more cache misses than the Score2 function.

It might be a much smaller issue in the serial code, where you've only got that happening once at any given time. When it's happening for a lot of threads simultaneously, though, the problem might compound so that what originally just caused an extra cache miss or two is now causing page faults.

I had some problems profiling the original program as well. It was pointing to a different routine as the bottleneck. I appreciate the idea of possible page faults. — jlim, Mar 08 '11 at 05:38

score 1 · Answer 3 · answered Feb 18 '11 at 06:41

So there is a post on codeproject that helps answer that.

http://www.codeproject.com/KB/cs/foreach.aspx

There you will see that the generated code is slitely different, so in a long list, you will loose a few cicles for those extra few lines, and it will change the final time.

Neville · Answer 4 · 2011-02-21T05:29:32.900

Well I've just browsed over your code and my first thoughts are action times. In your Score1 you perform new memory allocation for each run

valid[i] = new ValidWord

this in turn lets the application process memory finds and then either initialise it or create a new memory block, setup the values and copies over the newly created block to the original location (I forget which, not the point though).

The point I'm trying to make is that you are now requiring the application to make 14000 memory reads/write operations, all of which is taking x amount of (micro)seconds. And if new memory is being allocated it needs to find correct sized memory sections.

Code Performance Analysis is a rather wide topic, and I guess only embedded programmers truly utilise this on a daily basis. Just remember that every statement you make has operations linked to it. Reading Vector<bool> and Vector<int> for instance, the bool will have a slower read time because it needs to split memory into bits and then return a value, where the int can retrieve bigger chunks of memory.

Well, that's my 2 cents worth, hope it gives you a better idea of what to look for. I have a good book at home covering how to go about analysing your code lines and what process times it will use. will see if I can get a hold on it (recently moved) and update the name for you.

score 1 · Answer 5 · answered Feb 28 '11 at 02:42

What is the goal of the program? Score1 and Score2 don't tell us anything about what the algorithsm are trying to do. It looks likes its trying to find any word that is three letters with all capital 'U's removed is a valid word and is added to the list?

What is the point of calling Parallel.Foreach on a bunch of Program instances when each thing is passed the exact same input? And always creating a StringBuilder for each word is not a good approach. You want to minimize any new calls in performance critical areas in order to reduce the number of times the GC has to kick in.

I ran your program on the text file: http://introcs.cs.princeton.edu/data/words.txt

Run 1 Total Average Time = 0.160149 sec
Run 2 Total Average Time = 0.056846 sec

Running it under the VS 2010 Sampling profiler shows that Report1 is roughly 78x slower than Report2 and accounts for most of the difference. Mainly due to all the string.Format and Append calls.

Score1 and Score2 are roughly the same in speed with Score1 going slightly slower because of additional time in StringBuilder.ctor and clr.dll.

But I suspect your algorithm could be rewritten without all the string builders or allocations to be an order of magnitude faster.

I removed the input that was different to the score routines to simplify the posted code. +1 for taking the time to profile it. I was attempting to profile before posting but had a difficult time. The profiler was showing a different spot as a bottleneck. — jlim, Mar 08 '11 at 05:36

score 1 · Accepted Answer · edited May 23 '17 at 12:29

First of all, you are parallelizing the repeated runs. This will improve your benchmark time, but may not help out your real production run time very much. To accurately measure how long it will take to actually run through one word list, you need to have exactly one word list going at a time. Otherwise, the individual threads processing all the lists are competing with each other to some extent for system resources and the time per list suffers, even if the time to do all the lists in total is faster.

To speed up the time to process one word list, you want to processes the individual words in the list in parallel, for exactly one list at a time. To get enough definition/size for a good measurement, either make the list very long or process the list many times in serial.

In your case, this gets a bit tricky because the stringbuilder needed for your final product is not documented as being thread-safe. It's not that bad, though. Here's an example of calling parallel foreach for a single word list:

var locker = new Object(); //I'd actually make this static, but it should end up as a closure and so still work
var OutputBuffer = new StringBuilder(); // you can improve things futher if you can make a good estimate for the final size and force it allocate all the memory it will need up front
int score = 0;
Parallel.ForEach(words, w => 
{
   // We want to push as much of the work to the individual threads as possible.
   // If run in 1 thread, a stringbuilder per word would be bad.
   // Run in parallel, it allows us to do a little more of the work outside of locked code.
   var buf = new StringBuilder(w.Length + 5);
   string word = buf.Append(w.Where(c=>c!='U').Concat(' ').ToArray()).Append(w.Length).ToString();

   lock(locker)
   {
       OutputBuffer.Append(word);
       score += w.Length;
   }
});
OutputBuffer.Append("Total = ").Append(score);

Just call that 20 times in a normal sequentially processed for loop. Again, it might finish the benchmarks a little slower, but I think it will perform real-world a little faster because of a flaw in your benchmark. Also note that I typed this right into the reply window — I've never event tried to compile it, and so it's not likely to be perfect right out of the gate.

After fixing your benchmark to more accurately reflect how parallel code will impact your real-world processing time, the next step is to do some profiling to see where your program is actually spending it's time. This is how you know what areas to look at for improvement.

Out of curiosity, I'd also like to know how this version performs:

var agg = new {score = 0, OutputBuffer = new StringBuilder()};
agg = words.Where(w => w.Length == 3)
   .Select(w => new string(w.Where(c => c!='U').ToArray())
   .Aggregate(agg, (a, w) => {a.OutputBuffer.AppendFormat("{0} {1}\n", w, w.Length); score += w.Length;});
agg.OutputBuffer.Append("Total = ").Append(score);

Sauleil · Answer 7 · 2011-02-28T01:19:10.340

-1

Just an idea: I haven't made any measurement but for instance this line:

foreach (char c in words[i])

I think it would be better the to create a temporary variable for the current word.
Also, the iterator of a string might be slower.

The code would become something like that:

var currentWord = words[i];
for (int j = 0; j < currentWord.Length; j++){
    char c = currentWord[i]; 
    // ...
}

The new might also be a performance issue as someone already pinpointed. Like I said in my comment, adding more profiling data would help to pin point exactly what's going on. Or have a look at the generated code.

edited Feb 28 '11 at 01:19

answered Feb 16 '11 at 20:23

Sauleil

2,573
1
24
27

6

-1, the `foreach` version is optimized specifically by the compiler to not calculate `Length` over and over. Lots of "I think"s and not enough "I tested" make for a poor reply. – Blindy Feb 16 '11 at 20:47
4

Agreed. Guesses about which optimizations the compile will or will not perform are completely useless. – Ed S. Feb 16 '11 at 22:23
I don't know why people down mod so much... If you do some searches, you will see that the IL Code generated by the foreach and the for is different. (http://abi.exdream.com/Blog/post/2009/04/22/For-vs-Foreach-Performance.aspx Even Oakcool posted an interesting link for that matter). Also, I'm just giving you some ideas to help you to investigate the matter. I won't do the profiling in your place. – Sauleil Feb 28 '11 at 01:17

Help understanding C# optimization

7 Answers7