What does cause different performance of Math.Max in C#?

Question

I ran this on a laptop, 64-bit Windows 8.1, 2.2 Ghz Intel Core i3. The code was compiled in release mode and ran without a debugger attached.

static void Main(string[] args)
    {
        calcMax(new[] { 1, 2 });
        calcMax2(new[] { 1, 2 });

        var A = GetArray(200000000);
        var stopwatch = new Stopwatch();
        stopwatch.Start(); stopwatch.Stop();

        GC.Collect();
        stopwatch.Reset();
        stopwatch.Start();
        calcMax(A);
        stopwatch.Stop();
        Console.WriteLine("caclMax - \t{0}", stopwatch.Elapsed);

        GC.Collect();
        stopwatch.Reset();
        stopwatch.Start();
        calcMax2(A);
        stopwatch.Stop();
        Console.WriteLine("caclMax2 - \t{0}", stopwatch.Elapsed);

        Console.ReadKey();
    }

    static int[] GetArray(int size)
    {
        var r = new Random(size);
        var ret = new int[size];

        for (int i = 0; i < size; i++)
        {
            ret[i] = r.Next();
        }
        return ret;
    }

    static int calcMax(int[] A)
    {
        int max = int.MinValue;
        for (int i = 0; i < A.Length; i++)
        {
            max = Math.Max(max, A[i]);
        }
        return max;
    }

    static int calcMax2(int[] A)
    {
        int max1 = int.MinValue;
        int max2 = int.MinValue;

        for (int i = 0; i < A.Length; i += 2)
        {
            max1 = Math.Max(max1, A[i]);
            max2 = Math.Max(max2, A[i + 1]);
        }
        return Math.Max(max1, max2);
    }

Here are some statistics of program performance (time in miliseconds):

Framework 2.0

X86 platform: 2269 (calcMax) 2971 (calcMax2) [winner calcMax]

X64 platform: 6163 (calcMax) 5916 (calcMax2) [winner calcMax2]

Framework 4.5 (time in miliseconds)

X86 platform: 2109 (calcMax) 2579 (calcMax2) [winner calcMax]

X64 platform: 2040 (calcMax) 2488 (calcMax2) [winner calcMax]

As you can see the performance is different depend on framework and choosen compilied platform. I see generated IL code and it is the same for each cases.

The calcMax2 is under test because it should use "pipelining" of processor. But it is faster only with framework 2.0 on 64-bit platform. So, what is real reason of shown case in different performance?

Did you look whether the implementation changed across .NET versions? This isn't a valid profile either: there's no warmup and you do.. 2 iterations for each approach? — Jeroen Vannevel, Feb 08 '15 at 14:22

score 5 · Accepted Answer · edited May 23 '17 at 11:44

Just some notes worth mentioning. My processor (Haswell i7) doesn't compare well with yours, I certainly can't get close to reproducing the outlier x64 result.

Benchmarking is a hazardous exercise and it is very easy to make simple mistakes that can have big consequences on execution time. You can only truly see them when you look at the generated machine code. Use Tools + Options, Debugging, General and untick the "Suppress JIT optimization" option. That way you can look at the code with Debug > Windows > Disassembly and not affect the optimizer.

Some things you'll see when you do this:

You made a mistake, you are not actually using the method return value. The jitter optimizer opportunities like this where possible, it completely omits the max variable assignment in calcMax(). But not in calcMax2(). This is a classic benchmarking oops, in a real program you'd of course use the return value. This makes calcMax() look too good.
The .NET 4 jitter is smarter about optimizing Math.Max(), in can generate the code inline. The .NET 2 jitter couldn't do that yet, it has to make a call to a CLR helper function. The 4.5 test should thus run a lot faster, that it didn't is a strong hint at what really throttles the code execution. It is not the processor's execution engine, it is the cost of accessing memory. Your array is too large to fit in the processor caches so your program is bogged down waiting for the slow RAM to supply the data. If the processor cannot overlap that with executing instructions then it just stalls.
Noteworthy about calcMax() is what happens to the array-bounds check that C# performs. The jitter knows how to completely eliminate it from the loop. It however isn't smart enough to do the same in calcMax2(), the A[i + 1] screws that up. That check doesn't come for free, it should make calcMax2() quite a bit slower. That it doesn't is again a strong hint that memory is the true bottleneck. That's pretty normal btw, array bound checking in C# can have low to no overhead because it is so much cheaper than the array element access.

As for your basic quest, trying to improve super-scalar execution opportunities, no, that's not how processors work. A loop is not a boundary for the processor, it just sees a different stream of compare and branch instructions, all of which can execute concurrently if they don't have inter-dependencies. What you did by hand is something the optimizer already does itself, an optimization called "loop unrolling". It selected not to do so in this particular case btw. An overview of jitter optimizer strategies is available in this post. Trying to outsmart the processor and the optimizer is a pretty tall order and getting a worse result by trying to help is certainly not unusual.

score 1 · Answer 2 · answered Feb 08 '15 at 14:45

Many of the differences that you see are well within the range of tolerance, so they should be considered as no differences.

Essentially, what these numbers show is that Framework 2.0 was highly unoptimized for X64, (no surprise at all here,) and that overall, calcMax performs slightly better than calcMax2. (No surprise there either, because calcMax2 contains more instructions.)

So, what we learn is that someone came up with a theory that they could achieve better performance by writing high-level code that somehow takes advantage of some pipelining of the CPU, and that this theory was proved wrong.

The running time of your code is dominated by the failed branch predictions that are occurring within Math.max() due to the randomness of your data. Try less randomness (more consecutive values where the 2nd one will always be greater) and see if it gives you any better insights.

score 0 · Answer 3 · answered Feb 08 '15 at 15:14

Every time you run the program, you'll get slightly different results. Sometimes calcMax will win, and sometimes calcMax2 will win. This is because there is a problem comparing performance that way. What StopWhatch measures is the time elapsed since stopwatch.Start() is called, until stopwatch.Stop() is called. In between, things independent of your code can occur. For example, the operating system can take the processor from your process and give it for a while to another process running on your machine, due to the end of your process's time slice. after a while, your process gets the processor back for another time slice. Such occurrences cannot be controlled or foreseen by your comparison code, and thus the entire experiment shouldn't be treated as reliable.

To minimize this kind of measurement errors, you should measure every function many times (for example, 1000 times), and calculate the average time of all measurements. This method of measurement tends to significantly improve the reliability of the result, as it is more resilient to statistical errors.

What does cause different performance of Math.Max in C#?

3 Answers3