6

Why is it when I turn on "check for arithmetic underflow/overflow" under C# Project Properties > Build > Advanced, that the following code runs faster (138 ms) than if I turn the option off (141 ms)?

Test Run #1: 138ms with checked arithmetic, 141ms without

using System;
using System.Diagnostics;

class Program
{
    static void Main(string[] args)
    {
        var s = new Stopwatch();
        s.Start();
        int a = 0;
        for (int i = 0; i < 100000000; i += 3) {
            if (i == 1000)
                i *= 2;
            if (i % 35 == 0)
                ++a;
        }
        s.Stop();
        Console.WriteLine(s.ElapsedMilliseconds);
        Console.WriteLine(a);
    }
}

On the other hand, if you comment out the if (i == 1000) i *= 2;, then the checked code runs slower (120 ms) than the unchecked code (116 ms).

Test Run #2: 120ms with checked arithmetic, 116ms without

using System;
using System.Diagnostics;

class Program
{
    static void Main(string[] args)
    {
        var s = new Stopwatch();
        s.Start();
        int a = 0;
        for (int i = 0; i < 100000000; i += 3) {
            if (i % 35 == 0)
                ++a;
        }
        s.Stop();
        Console.WriteLine(s.ElapsedMilliseconds);
        Console.WriteLine(a);
    }
}

Process: Manually ran .exe outside Visual Studio from PowerShell prompt repeatedly until resulting timestamp was consistent (± 1 ms); flipped between settings multiple times to ensure consistent results.

Test Box Setup:

  • Windows 8.1 Pro x64
  • VS2013 Update 2
  • Intel Core i7-4500
  • default C# Console project template
  • Release configuration
Edward Brey
  • 40,302
  • 20
  • 199
  • 253
  • 9
    Over how many runs? Sounds more likely to be a matter of just general benchmarking fuzziness than anything else. – Jon Skeet May 19 '14 at 11:40
  • The difference might just be due to the insufficient SNR of the measurement. – Jon May 19 '14 at 11:40
  • 5
    *sometimes*? Micro-benchmarking isn't easy. And a 3ms difference doesn't seem significant. – dcastro May 19 '14 at 11:42
  • @JonSkeet: I used a simple, but I believe reliable, benchmarking approach. I didn't go into detail about it in the question, because I didn't want to distract from the core curiosity. I figured others could quickly independently verify the behavior using whatever method they liked. But for the record, I wrapped the above code in a simple `var s = new Stopwatch(); s.Start();` and `s.Stop();`, built it in release mode and ran it a bunch of times from the console. I did that for each overflow setting, going back and forth between the two to confirm the results were consistent. – Edward Brey May 19 '14 at 11:49
  • I can reproduce the test but there is only a difference of 2ms (executed 4 runs, each executing the above code 100 times, release mode). I'm not sure this is significant difference, though. – ken2k May 19 '14 at 11:57
  • @GeorgeStocker: Fair enough. I added all the relevant benchmark details I could think of to the question. – Edward Brey May 19 '14 at 12:05
  • From my experiments, checked arithmetic was horribly slower. – leppie May 19 '14 at 12:09
  • It could also be the JIT that is clever enough to see that `i` can never overflow/underflow in your loop. – leppie May 19 '14 at 12:11
  • @GeorgeStocker: Another good point. I updated the question to link to a gist with the full source. – Edward Brey May 19 '14 at 12:14
  • While I still believe it is too small difference to take it into account, I think it can be related to branch prediction like it is described here http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array – Lanorkin May 19 '14 at 12:15
  • 3
    @EdwardBrey Your code *should be in the question itself*. It should never be only accessible through a link. I edited your question to show you what it should look like. If you're going to ask us a benchmarking question, *this should be how it should be set up*. – George Stocker May 19 '14 at 12:24
  • The answer is that you are dealing with many constants that allows the JIT to make safe assumptions that `i` will never overflow. If you use something like a Fibbonacci benchmark (https://gist.github.com/leppie/57d9ca297890fa6dd22f), the difference becomes clear. 2770ms vs 4150ms (AnyCPU, 32-bit prefered) – leppie May 19 '14 at 12:30
  • @GeorgeStocker: My personal preference is a smaller snippet that focuses on the code with interesting performance characteristics and a pointer to a gist with the full code that is easy to Ctrl+A, Ctrl+C. But I'll assume that you're coming from a community consensus best practice, so I'm fine with how you're reformatted the question. – Edward Brey May 19 '14 at 12:38
  • 1
    The speed difference is too small, your code doesn't perform a checked computation enough. The signal is there but it drowns in the noise, modern processor have very non-deterministic code execution time. You can see it by repeating the test at 20 times with an outer for(;;) loop, the unchecked version is ~6% faster for x64, ~2% for x86. Note how the cost of jitting is visible in the first pass through the loop. Perf differences less than 15% are not meaningful, a simple misaligned branch is enough. – Hans Passant May 19 '14 at 13:02
  • @HansPassant: I couldn't repro your distinction with an outer for(;;) loop. In [my test](https://gist.github.com/breyed/44861b168bcb57e6dc63), checked was coming in at 2784 ms, whereas unchecked was coming in at 3023 ms. However, if a misaligned branch is indeed the culprit, then the details of the test setup would make all the difference. – Edward Brey May 19 '14 at 14:32
  • Proper benchmarking is a fine art. If you measure this code at 3 seconds then you are not close to getting it right, few programmers have a machine that's 20 times slower than my pokey laptop. Leppie's example ought to be a lot more resilient against major mistakes. – Hans Passant May 19 '14 at 14:45
  • @HansPassant: The 3 seconds duration was a result of attempting to repro the variation you described in your comment. Your results came in around 3 seconds as well, correct? Since the results across multiple runs were very tight compared with the difference between checked and unchecked, I doubt that the length of running the test matters in this case. – Edward Brey May 19 '14 at 14:50
  • You are not doing it right, each pass through the outer loop should make its own measurement. Only way you can eliminate transient overhead from the measurement. Just put the for(;;) loop around the body of your current Main() method. – Hans Passant May 19 '14 at 14:53

2 Answers2

8

The answer is that you are dealing with many constants that allows the JIT to make safe assumptions that it will never overflow. If you use something like a Fibbonacci benchmark, the difference becomes clear.

2770ms vs 4150ms (AnyCPU, 32-bit prefered)

using System;
using System.Diagnostics;

class Program
{
    static void Main(string[] args)
    {
        var s = new Stopwatch();
        s.Start();
        int a = 0;
        for (int i = 0; i < 100000000; i++)
        {
            a = Fibonacci(45);
        }
        s.Stop();
        Console.WriteLine(s.ElapsedMilliseconds);
    }

    public static int Fibonacci(int n)
    {
        int a = 0;
        int b = 1;
        for (int i = 0; i < n; i++)
        {
            int temp = a;
            a = b;
            // if the JIT compiler is clever, only this one needs to be 'checked'
            b = temp + b; 
        }
        return a;
    }

}
leppie
  • 115,091
  • 17
  • 196
  • 297
  • 1
    Note: There are quite big differences between what CLR version is used and what bitness is forced. It would seem the latest .NET updates has indeed improved the performance. I was 'forced' to use intermediate longs in IronScheme due to tests I ran a few years back. It would seem checked arithmetic has become fast enough to justify using it. \o/ – leppie May 19 '14 at 12:52
4

In this case, checked arithmetic is faster than unchecked for two reasons:

  • The compiler was able to determine that checking most of the arithmetic was unnecessary, and so the added overhead for checking was slight. This was an artifact of the simplicity of the test. leppie's answer gives a good example of an algorithm with a substantial perf difference.

  • The machine instructions inserted to perform the check happened to cause a key branch destination to fall on an alignment boundary. You can see this in two ways:

    • Replace int a = 0; with int a = args.Length;. Run the tests and observe that the performance inversion is gone. The reason is that the additional code causes the branch destination to be aligned.

    • Examine the assembly from the original test below. I obtained it by adding Process.EnterDebugMode(); and Debugger.Break(); to the end of Main, and running the Release mode .exe from the command line. Notice how when the checked code does the test for i % 35 == 0, if false, it branches to 00B700CA, which is an aligned instruction. Contrast that with the unchecked code, which branches to 012D00C3. Even though the checked code has an additional jo instruction, its cost is outweighed by the savings of the aligned branch.

Checked

        int a = 0;
00B700A6  xor         ebx,ebx  
        for (int i = 0; i < 100000000; i += 3) {
00B700A8  xor         esi,esi  
            if (i == 1000)
00B700AA  cmp         esi,3E8h  
00B700B0  jne         00B700B7  
                i *= 2;
00B700B2  mov         esi,7D0h  
            if (i % 35 == 0)
00B700B7  mov         eax,esi  
00B700B9  mov         ecx,23h  
00B700BE  cdq  
00B700BF  idiv        eax,ecx  
00B700C1  test        edx,edx  
00B700C3  jne         00B700CA  
                ++a;
00B700C5  add         ebx,1  
00B700C8  jo          00B70128  
        for (int i = 0; i < 100000000; i += 3) {
00B700CA  add         esi,3  
00B700CD  jo          00B70128  
00B700CF  cmp         esi,5F5E100h  
00B700D5  jl          00B700AA  
        }

Unchecked

        int a = 0;
012D00A6  xor         ebx,ebx  
        for (int i = 0; i < 100000000; i += 3) {
012D00A8  xor         esi,esi  
            if (i == 1000)
012D00AA  cmp         esi,3E8h  
012D00B0  jne         012D00B4  
                i *= 2;
012D00B2  add         esi,esi  
            if (i % 35 == 0)
012D00B4  mov         eax,esi  
012D00B6  mov         ecx,23h  
012D00BB  cdq  
012D00BC  idiv        eax,ecx  
012D00BE  test        edx,edx  
012D00C0  jne         012D00C3  
                ++a;
012D00C2  inc         ebx  
        for (int i = 0; i < 100000000; i += 3) {
012D00C3  add         esi,3  
012D00C6  cmp         esi,5F5E100h  
012D00CC  jl          012D00AA  
        }
Edward Brey
  • 40,302
  • 20
  • 199
  • 253
  • Thanks for Hans Passant for the suggestion to look for branch alignment as the primary cause. – Edward Brey May 19 '14 at 15:26
  • Ironically, the `i == 1000` condition that caused the surprising performance inversion was an attempt to prevent over-simplicity, while looking to see if checked vs. unchecked made a big perf difference after having listened to [Crawford on .NET Rocks](http://dotnetrocks.com/default.aspx?showNum=982). – Edward Brey May 19 '14 at 15:27