10

I have this simple loop:

int[] array = new int[100000000];
int sum = 0;
for (int i = 0; i < array.Length; i++)
    sum += array[i];

I compared its performance with its C++ version. I though that the performance should be near the same, because it is very simple code and the range checks are omitted in that case. But it turned out that the C++ version is almost three times faster. So I have implemented C# unsafe version but the performance was even worse. Resharper suggests to convert the loop to linq expression like this:

sum = array.Sum();

That code is many times slower than the original loop in C#

Could someone tell me if there is something more that I can do to improve the performance of this loop (without compiling it to 64 bit version - which is two times faster).

All test where made on 32 bit Release version and run without debugger.

Edit: Small correction. 64 bit version is two times faster with doubles, not ints

userx01233433
  • 366
  • 1
  • 4
  • 13
  • You could use multiple threads to add those numbers – Anirudha Oct 13 '13 at 16:06
  • What happens if you mark it unchecked? – Silvermind Oct 13 '13 at 16:15
  • 2
    Why would 64bits be 2 times faster? – Deblaton Jean-Philippe Oct 13 '13 at 16:15
  • @patxy: [link](http://stackoverflow.com/questions/5738448/is-float-slower-than-double-does-64-bit-program-run-faster-than-32-bit-program) – userx01233433 Oct 13 '13 at 16:21
  • 1
    btw, c++ is totally compiled. C# is pre-compiled, then interpreted by the clr. – Deblaton Jean-Philippe Oct 13 '13 at 16:21
  • 1
    @Silvermind: I think 'unchecked' have no influence on doubles only on integers. I think also that the default option for compiler is 'unchecked' – userx01233433 Oct 13 '13 at 16:26
  • 5
    I'm clocking it at 310 vs 420 msec. Not bad at all, considering my C++ compiler auto-parallelizes the loop. You are probably making the traditional benchmarking mistakes like measuring jitting time, running the debug build or having a debugger attached so the optimizer is disabled. – Hans Passant Oct 13 '13 at 16:34
  • 1
    Are you sure this is the bottleneck in your application? – myermian Oct 13 '13 at 16:36
  • The reason why Linq extension `.Sum()` is much slower *could* really be that this method actually does check for overflows, hence is comparable to a `checked` explicit loop (`for` or `foreach`). – Jeppe Stig Nielsen Oct 13 '13 at 19:59
  • @patxy: Sorry, I was thinking about array of doubles that I was testing at that time. And You are right. 64 bit version should't be faster with array of ints. – userx01233433 Oct 13 '13 at 20:37
  • @Hans: You're talking about spinning up additional threads, or you meant auto-vectorization (SIMD)? Loop unrolling is also a nice optimization that C++ compilers are likely to use here. – Ben Voigt Oct 13 '13 at 21:39
  • @patxy The code is most definitely **NOT** interpreted by the CLR. JIT compilation is not interpretation. – luiscubal Oct 13 '13 at 23:11

6 Answers6

15
var watch = new Stopwatch();

int[] array = new int[100000000];
for (int i = 0; i < array.Length; i++)
{
    array[i] = 1;
}

watch.Restart();
int sum = 0;
for (int i = 0; i < array.Length; i++)
    sum += array[i];
Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

sum = 0;
watch.Restart();
sum = array.Sum();
Console.WriteLine("linq sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

sum = 0;
watch.Restart();
int length = array.Length;
for (int i = 0; i < length; i++)
    sum += array[i];
Console.WriteLine("for loop fixed:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

sum = 0;
watch.Restart();
foreach (int i in array)
{
    sum += i;
}
Console.WriteLine("foreach sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

sum = 0;
watch.Restart();
sum = array.AsParallel().Sum();
Console.WriteLine("linq parallel sum:" + watch.ElapsedMilliseconds + "ms, result:" + sum);

Linq Parallel seems to be fasted on my machine at least.

Fixing the length doesn't matter much but improves ~10%

There is not much you could actually do, unmanaged C code will always be faster for this.

Results on my PC are:

for loop:      241ms, result:100000000
linq sum:      559ms, result:100000000
for loop fixed:237ms, result:100000000
foreach sum:   295ms, result:100000000
linq parallel: 205ms, result:100000000
MichaC
  • 13,104
  • 2
  • 44
  • 56
  • Can you post the results? – Magnus Oct 13 '13 at 16:16
  • Thanks. It is little sad that such simple loop cannot run as fast as unmanaged code. Especially that at that point GC is not involved and no range checks are done. – userx01233433 Oct 13 '13 at 16:36
  • I do not see the same relative performance as you: for loop:78ms; linq sum:751ms; for loop fixed:76ms; foreach sum:77ms; linq parallel sum:183ms. That's using an i7 920 with hyperthreading enabled. For the OP's 100000000 elements, it does not even bother spinning up every available core; doubling the number of elements does. Also, I factored out the `watch.ElapsedMilliseconds` as that had an effect on the reported timings. – Andrew Morton Oct 13 '13 at 16:41
  • Interestingly the first time I ran your code(on LinqPad), time for linq parallel came out well above 1000ms. But for the 2nd run it came out 336ms. Any ideas why? – th1rdey3 Oct 13 '13 at 16:50
  • 3
    Run this test in debug/release makes a lot of difference. For loop: 430/49 ms, Ling sum: 724/714 ms, For loop fixed: 393/48 ms, Foreach sum: 492/49 ms, Ling parallel: 168/148 ms. Remember to run check your code in release mode when testing. – Casperah Oct 13 '13 at 17:07
  • @Casperah right, it also makes a difference if you compile it for x64 or x86... Btw. for some reason the linq parallel query runs 50% faster when I put it into a unsafe block... ~~ – MichaC Oct 13 '13 at 17:16
  • 2
    IMHO, This doesn't answer the basic question of comparative speed, but it does speed up the C#. In a fair comparison we would re-write the C++ version to be parallel, and then find out that the C# code is still 2 times slower then the now equivalent loop in C++. – Xantix Oct 13 '13 at 18:28
10

Unroll the loop 2-8 times. Measure which one is best. The .NET JIT optimizes poorly, so you have to do some of its work.

You'll probably have to add unsafe as well because the JIT will now be unable to optimize out the array bounds checks.

You can also try to aggregate into multiple sum variables:

int sum1 = 0, sum2 = 0;
for (int i = 0; i < array.Length; i+=2) {
    sum1 += array[i+0];
    sum2 += array[i+1];
}

That might increase instruction-level parallelism because all add instructions are now independent.

The i+0 is optimized to i automatically.


I tested it and it shaved off about 30%.

The timings are stable when repeated. Code:

        Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;

        var watch = new Stopwatch();

        int[] array = new int[500000000];
        for (int i = 0; i < array.Length; i++)
        {
            array[i] = 1;
        }

        //warmup
        {
            watch.Restart();
            int sum = 0;
            for (int i = 0; i < array.Length; i++)
                sum += array[i];
        }

        for (int i2 = 0; i2 < 5; i2++)
        {
            {
                watch.Restart();
                int sum = 0;
                for (int i = 0; i < array.Length; i++)
                    sum += array[i];
                Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
            }

            {
                watch.Restart();
                fixed (int* ptr = array)
                {
                    int sum = 0;
                    var length = array.Length;
                    for (int i = 0; i < length; i++)
                        sum += ptr[i];
                    Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
                }
            }

            {
                watch.Restart();
                fixed (int* ptr = array)
                {
                    int sum1 = 0;
                    int sum2 = 0;
                    int sum3 = 0;
                    int sum4 = 0;
                    var length = array.Length;
                    for (int i = 0; i < length; i += 4)
                    {
                        sum1 += ptr[i + 0];
                        sum2 += ptr[i + 1];
                        sum3 += ptr[i + 2];
                        sum4 += ptr[i + 3];
                    }
                    Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + (sum1 + sum2 + sum3 + sum4));
                }
            }

            Console.WriteLine("===");
        }

Further playing around, it turns out that multiple aggregation variables do nothing. Unrolling the loop did a major improvement, though. Unsafe did nothing (except in the unrolled case where it is pretty much required). Unrolling 2 times is as good as 4.

Running this on a Core i7.

usr
  • 168,620
  • 35
  • 240
  • 369
  • 1
    You mentioned that the .NET JIT optimizes poorly, i was interested and implemented a [small benchmark in java](http://pastebin.com/azXeCn9W) to compare. It runs way faster compared to the C# solution on my laptop. Is Java really optimizing **that** better? I ran all tests `100` times in a loop. The parallel linq solution was the fastest with _min: 158ms_ compared to my fastest Java `RecursiveTask` implementation with _min: 21ms_. The simple array loop in java took only _min: 43ms_. – Ortwin Angermeier Oct 13 '13 at 19:15
  • 2
    I'd take parallelism out of the equation. This is an issue orthogonal to JIT quality. Only measure sequential, maximally optimized code. – usr Oct 13 '13 at 19:29
  • Your average Java JVM optimization is leaps and bounds ahead of the .NET CLR. It's unfortunate, because the C# language has so many other pros. – Special Sauce Dec 18 '15 at 21:46
7

First a couple of general remarks about micro benchmarks like this:

  • The timings here are so short that JIT time can be significant. This is important because a parallel ForEach loop contains an anonymous delegate that is only JITted when it is first called and so the JIT time is included in the timing the first time the benchmark is run.
  • The context of the code is important as well. The JITter can do a better job optimizing small functions. Isolating the benchmark code in its own function may have a significant impact on performance.

There are four basic techniques for speeding up the code (if we keep it pure CLR):

  1. Parallellize it. This is obvious.
  2. Unroll loops. This reduces the number of instructions by only doing a comparison every 2 or more iterations.
  3. Using unsafe code. This is not much of a benefit in this case because the main issue (range checks on the array) is optimized away.
  4. Allow the code to be optimized better. We can do this by putting the actual benchmark code in a separate method.

Here is the parallel code:

var syncObj = new object();
Parallel.ForEach(Partitioner.Create(0, array.Length),
    () => 0,
    (src, state, partialSum) => {
        int end = src.Item2;
        for (int i = src.Item1; i < end; i++)
            partialSum += array[i];
        return partialSum;
    },
    partialSum => { lock (syncObj) { s += partialSum; } });

The Partitioner class lives in the System.Collections.Concurrent namespace.

On my machine (i7 950, 8 logical cores), the timings I got were:

For loop: 196.786 ms
For loop (separate method): 72.319 ms
Unrolled for loop: 196.167 ms
Unrolled for loop (separate method): 67.961 ms
Parallel.Foreach (1st time): 48.243 ms
Parallel.Foreach (2nd time): 26.356 ms

There was no significant difference between 32 bit and 64 bit code.

Jeffrey Sax
  • 10,253
  • 3
  • 29
  • 40
0

I've added the following to @Ela's answer:

            sum = 0;
        watch.Restart();
        var _lock = new object();
        var stepsize = array.Length / 16;
        Parallel.For(0, 16,
            (x, y) =>
            {
                var sumPartial = 0;
                for (var i = x * stepsize; i != (x + 1) * stepsize; ++i)
                    sumPartial += array[i];
                lock (_lock)
                    sum += sumPartial;
            });
        Console.Write("Parallel.For:" +  watch.ElapsedMilliseconds + " ms, result:" + sum);

and then printed the results so you get reference values:

for loop:893ms, result:100000000
linq sum:1535ms, result:100000000
for loop fixed:720ms, result:100000000
foreach sum:772ms, result:100000000
Parallel.For:195 ms, result:100000000

As you can see, waaay faster :) For Stepsize, i tried arr.Length / 8, arr.Length / 16, arr.Length / 32 (i got i7 cpu(4 cores * 2 Threads = 8 Threads simultaneously)) and they were all pretty much the same, so it's your choice

Edit: I also tried stepsize = arr.length / 100, which is somewhere @ 250ms, so a little bit slower.

Nefarion
  • 872
  • 2
  • 8
  • 28
0

Using immediate operands will improve the performance to some extent,

int[] array = new int[100000000];
int sum = 0;
for (int i = 0; i < array.Length; i++)
    sum += array[i];

The above code needs to access two memory locations, i.e int i and array.length;

Instead use,

int[] array = new int[100000000];
int sum = 0;
int arrayLength=array.length;
for (int i = arrayLength-1; i >0; i--)
    sum += array[i]; 
  • 1
    According to this msdn [blog](http://blogs.msdn.com/b/clrcodegeneration/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx) range check is done for descending version of the loop. – userx01233433 Oct 14 '13 at 07:31
  • Also I think that this two variables will be cached in registry. – userx01233433 Oct 14 '13 at 07:32
0

An easy and sometimes significant C# for loop optimization that is often overlooked is switching the loop counter variable type from int to uint. This leads to about 12% speedup on average for your standard i++ (increment) loops with millions of iterations. If your loop iterates less than that, it's probably not going to change performance much.

Note that arrays can be indexed by uint without casting to int so you don't lose any of the benefits when indexing inside the loop. The only common reasons to not use this technique are if you need negative loop counter values, or if the loop counter variable needs to be cast to int for other function calls, etc. inside the loop. As soon as you need to cast, it's probably not worth it.

Special Sauce
  • 5,338
  • 2
  • 27
  • 31