44

Why does commenting out the first two lines of this for loop and uncommenting the third result in a 42% speedup?

int count = 0;
for (uint i = 0; i < 1000000000; ++i) {
    var isMultipleOf16 = i % 16 == 0;
    count += isMultipleOf16 ? 1 : 0;
    //count += i % 16 == 0 ? 1 : 0;
}

Behind the timing is vastly different assembly code: 13 vs. 7 instructions in the loop. The platform is Windows 7 running .NET 4.0 x64. Code optimization is enabled, and the test app was run outside VS2010. [Update: Repro project, useful for verifying project settings.]

Eliminating the intermediate boolean is a fundamental optimization, one of the simplest in my 1980's era Dragon Book. How did the optimization not get applied when generating the CIL or JITing the x64 machine code?

Is there a "Really compiler, I would like you to optimize this code, please" switch? While I sympathize with the sentiment that premature optimization is akin to the love of money, I could see the frustration in trying to profile a complex algorithm that had problems like this scattered throughout its routines. You'd work through the hotspots but have no hint of the broader warm region that could be vastly improved by hand tweaking what we normally take for granted from the compiler. I sure hope I'm missing something here.

Update: Speed differences also occur for x86, but depend on the order that methods are just-in-time compiled. See Why does JIT order affect performance?

Assembly code (as requested):

    var isMultipleOf16 = i % 16 == 0;
00000037  mov         eax,edx 
00000039  and         eax,0Fh 
0000003c  xor         ecx,ecx 
0000003e  test        eax,eax 
00000040  sete        cl 
    count += isMultipleOf16 ? 1 : 0;
00000043  movzx       eax,cl 
00000046  test        eax,eax 
00000048  jne         0000000000000050 
0000004a  xor         eax,eax 
0000004c  jmp         0000000000000055 
0000004e  xchg        ax,ax 
00000050  mov         eax,1 
00000055  lea         r8d,[rbx+rax] 
    count += i % 16 == 0 ? 1 : 0;
00000037  mov         eax,ecx 
00000039  and         eax,0Fh 
0000003c  je          0000000000000042 
0000003e  xor         eax,eax 
00000040  jmp         0000000000000047 
00000042  mov         eax,1 
00000047  lea         edx,[rbx+rax] 
Community
  • 1
  • 1
Edward Brey
  • 40,302
  • 20
  • 199
  • 253
  • 4
    I'd be curious to see the different assembly code. Could you post it? – phoog Apr 29 '12 at 03:25
  • 1
    have you tested bool isMultipleOf16 = ...? – David.Chu.ca Apr 29 '12 at 03:29
  • 3
    @David.Chu.ca - that wouldn't make a difference - `var` is "compiler, please infer the type of this variable, and pretend I wrote that instead". In this case, it will have inferred `bool` for itself. – Damien_The_Unbeliever Apr 29 '12 at 03:33
  • how about static isMulitppleOF16? – David.Chu.ca Apr 29 '12 at 04:24
  • 9
    @EdwardBrey: Since you did this in *Debug* mode all bets are off – BrokenGlass Apr 29 '12 at 05:15
  • 1
    The only way to get this code is by having the jitter optimizer disabled. Invalid test. – Hans Passant Apr 29 '12 at 11:34
  • 1
    @BrokenGlass You sound rather confident. I was careful to ensure I was in Release mode and that optimizations remained enabled. To verify, I turned optimization off and verified that the timings got slower and the assembly code got larger. I changed the Project Configuration to "Debug" and turned on the "Suppress JIT optimization on module" setting. When I went back to unsuppressed release optimized, the code got smaller and faster again. If I am making an error, is there anything I should be take note of to spot it? – Edward Brey Apr 29 '12 at 11:35
  • @Hans I tried two separate computers with the same result. I also reimplemented on multiple projects, because I couldn't believe it was really true. Still, same result. I haven't done anything I know of to disable the JIT optimizers on my systems, although I don't know where to look for this. Can you repro? – Edward Brey Apr 29 '12 at 11:43
  • Sure, I can get the same code you posted by turning on the "Suppress JIT optimization" debugger option. The optimized code is *very* different. – Hans Passant Apr 29 '12 at 11:47
  • @Hans The 13 lines of assembly code above are with "Suppress JIT optimization" turned off. If I turn suppression on, it gets even worse and balloons to 21 lines of assembly code. – Edward Brey Apr 29 '12 at 12:25
  • I cannot repro your result with the exact same OS and VS+.NET versions, no idea why you are getting unoptimized code. – Hans Passant Apr 29 '12 at 12:31
  • @Hans Let's try to narrow this down. I put a [repro on GitHub](https://github.com/breyed/PerfTest). What do you get for results? – Edward Brey Apr 29 '12 at 13:50
  • 3
    @EdwardBrey: I can't find a source at the moment, but I believe the jitter and/or other optimizer settings are different if you have a debugger attached *at all* (that is, if you're running from Visual Studio, even if you compiled in "Release" mode). Try running your code from the command line (not from VS) and see what happens. – Daniel Pryden Apr 29 '12 at 20:55
  • @Daniel I ran all my timing tests from the command line outside Visual Studio. I only ran within Visual Studio to get assembly code listings. As a double-check, just now I tried running the timing test with Visual Studio not even running. I still see the different timings. – Edward Brey Apr 29 '12 at 23:57
  • 1
    @EdwardBrey: Running the code as x86, makes both run the same time. Edit: But only on .NET 4. .NET 2 has a difference with both. – leppie Apr 30 '12 at 12:22
  • @leppie Could you try the x86 configuration on the [test project on github](https://github.com/breyed/PerfTest)? (Github has a ZIP button to quickly grab a zip file of the project.) I get very different timings between the one- and two-statement tests, even for x86. – Edward Brey Apr 30 '12 at 13:52
  • @EdwardBrey: That's what I did. – leppie Apr 30 '12 at 13:56
  • @leppie If you turn off "Suppress JIT optimization" in the VS2010 options and put a breakpoint on a statement in the loop, do you see different assembly code for the different forms, specifically the addition of `sete`, `movzx`, and `test` instructions for the two-statement form? – Edward Brey Apr 30 '12 at 14:35
  • When I dump the JIT compiled code for the two versions (via WinDbg), I do see differences, but I also see significantly more code than you list in your question. Remember there is not a straight forward mapping between the source lines and the JIT compiled code, so you should list all of it. Comparing just the body of the loop may not provide enough detail for a comparison. – Brian Rasmussen May 02 '12 at 03:38

5 Answers5

9

Question should be "Why do I see such a difference on my machine?". I cannot reproduce such a huge speed difference and suspect there is something specific to your environment. Very difficult to tell what it can be though. Can be some (compiler) options you have set some time ago and forgot about them.

I have create a console application, rebuild in Release mode (x86) and run outside VS. Results are virtually identical, 1.77 seconds for both methods. Here is the exact code:

static void Main(string[] args)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    int count = 0;

    for (uint i = 0; i < 1000000000; ++i)
    {
        // 1st method
        var isMultipleOf16 = i % 16 == 0;
        count += isMultipleOf16 ? 1 : 0;

        // 2nd method
        //count += i % 16 == 0 ? 1 : 0;
    }

    sw.Stop();
    Console.WriteLine(string.Format("Ellapsed {0}, count {1}", sw.Elapsed, count));
    Console.ReadKey();
}

Please, anyone who has 5 minutes copy the code, rebuild, run outside VS and post results in comments to this answer. I'd like to avoid saying "it works on my machine".

EDIT

To be sure I have created a 64 bit Winforms application and the results are similar as in the the question - the first method is slower (1.57 sec) than the second one (1.05 sec). The difference I observe is 33% - still a lot. Seems there is a bug in .NET4 64 bit JIT compiler.

Maciej
  • 7,871
  • 1
  • 31
  • 36
  • First method: 1.8736291s, second method: 1.8566318s on my machine, rebuilt with Release (x86), ran outside VS, using the exact same code. – ShdNx Apr 30 '12 at 10:26
  • 1
    You need to do something with `count` (such as include it in your `WriteLine` statement). Otherwise the optimizer does some selective optimization, which changes with the timing. – Edward Brey Apr 30 '12 at 11:25
  • @EdwardBrey, I have changed my 64 bit tests and got the same results as you are now. Corrected my answer to reflect. – Maciej Apr 30 '12 at 11:48
  • 2
    @EdwardBrey, I can only reproduce it in 64 bit app – Maciej Apr 30 '12 at 12:02
  • I got 1.86s (1st) and 1.38s (2nd) in x64, while in x86 I got around 2.09s for both methods. – Ral Zarek Apr 30 '12 at 12:24
  • I updated the [test project on github](https://github.com/breyed/PerfTest) to include x64 and x86 configurations. After doing a batch rebuild of both configurations and running each .exe outside VS2010 on two computers (both Core i5), I see a very different timings for Test1 and Test2 on both x64 and x86. Maciej and Marcus, to help eliminate variables, could you each grab the project and try it? (Github has a ZIP button to quickly grab a zip file of the project.) – Edward Brey Apr 30 '12 at 13:53
  • 3
    @EdwardBrey Running your test gives me slower execution on Multiline version on both platforms. But if I change the test so that it runs 4xMultiline and then 4xSingleline there is no speed difference on x86 (x64 is unaffected) – Maciej May 01 '12 at 16:17
  • 1
    @Maciej +1 for a great observation. It looks like which method gets called first is what matters. This appears to be almost certainly due to JIT order. But why JIT order matters is perplexing. I updated the question body with a link to the [new question](http://stackoverflow.com/questions/10406796/why-does-jit-order-affect-performance) that your observation prompted. – Edward Brey May 02 '12 at 02:13
  • @EdwardBrey Can you upvote my answer instead of the comment as it will show "real" appreciation :-) – Maciej May 02 '12 at 07:23
  • @Maciej Sorry, but I don't think your answer really answers the question. It helped us find a confounding issue, namely that for x86, alignment penalties can cause well-optimized code to run as slowly as less-than-fully optimized code. But the question still remains, why are there cases where less-than-fully optimized x64 (and x86) code is generated in the first place? – Edward Brey May 02 '12 at 11:52
4

I can't speak to the .NET compiler, or its optimizations, or even WHEN it performs its optimizations.

But in this specific case, if the compiler folded that boolean variable in to the actual statement, and you were to try and debug this code, the optimized code would not match the code as written. You would not be able to single step over the isMulitpleOf16 assignment and check it value.

Thats just one example of where the optimization may well be turned off. There could be others. The optimization may happen during the load phase of the code, rather than the code generation phase from the CLR.

The modern runtimes are pretty complicated, especially if you throw in JIT and dynamic optimization over run time. I feel grateful the code does what it says at all sometimes.

Will Hartung
  • 115,893
  • 19
  • 128
  • 203
  • I wondered to myself when I saw the assembly code if somehow optimization was disabled. I got the assembly code by stopping at a breakpoint in the VS2010 debugger and using the Disassembly window (whereas the timings I got by running without a debugger). As a test, I turned on the Tools > Options > Debugging > General > "Suppress JIT optimization on module" setting. Sure enough, the assembly code got even larger. – Edward Brey Apr 29 '12 at 04:26
  • In the native C++ world, it's perfectly normal for breakpoints and code order to be weird when optimizations are enabled. Likewise, variables like `isMultipleOf16` are not always available in the debugger. That's why there is debug mode. At the end of the day, we're still running the same machine code, so I don't see why the CLR would make things any different. Indeed, when exceptions occur in C#, even in debug mode I sometimes get a message about a variable's value being "optimized away", even in debug mode. – Edward Brey Apr 29 '12 at 04:31
  • 1
    +1 for the remark that debug settings can affect codegeneration. – Marco van de Voort Apr 29 '12 at 08:56
3

It's a bug in the .NET Framework.

Well, really I'm just speculating, but I submitted a bug report on Microsoft Connect to see what they say. After Microsoft deleted that report, I resubmitted it on roslyn project on GitHub.

Update: Microsoft has moved the issue to the coreclr project. From the comments on the issue, calling it a bug seems a bit strong; it's more of a missing optimization.

Edward Brey
  • 40,302
  • 20
  • 199
  • 253
  • 9
    If I had a dollar for every time a programmer told me, "My code doesn't work. It must be a bug in the framework (or compiler or runtime library, etc.)," and later found out that it was a bug in his own code, I could retire. – Jim Mischel May 02 '12 at 03:30
  • 1
    @Jim: I've seen it plenty of times myself. The best antidote I know is to isolate the behavior as fundamentally as possible and provide the vendor with a repro. And to keep a wait and see attitude. That's where we're at. – Edward Brey May 07 '12 at 00:45
  • 1
    @TankorSmash Perhaps Microsoft deleted it as part of moving the code to GitHub. It's not on my Connect dashboard anymore. It seems that many issues I reported are gone. Some sort of notification would have been nice. I resubmitted the issue to the GitHub project and updated the answer accordingly. – Edward Brey Nov 03 '15 at 22:13
2

I think this is related to your other question. When I change your code as follows, the multi-line version wins.

oops, only on x86. On x64, multi-line is the slowest and the conditional beats them both handily.

class Program
{
    static void Main()
    {
        ConditionalTest();
        SingleLineTest();
        MultiLineTest();
        ConditionalTest();
        SingleLineTest();
        MultiLineTest();
        ConditionalTest();
        SingleLineTest();
        MultiLineTest();
    }

    public static void ConditionalTest()
    {
        Stopwatch stopwatch = new Stopwatch();
        stopwatch.Start();
        int count = 0;
        for (uint i = 0; i < 1000000000; ++i) {
            if (i % 16 == 0) ++count;
        }
        stopwatch.Stop();
        Console.WriteLine("Conditional test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
    }

    public static void SingleLineTest()
    {
        Stopwatch stopwatch = new Stopwatch();
        stopwatch.Start();
        int count = 0;
        for (uint i = 0; i < 1000000000; ++i) {
            count += i % 16 == 0 ? 1 : 0;
        }
        stopwatch.Stop();
        Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
    }

    public static void MultiLineTest()
    {
        Stopwatch stopwatch = new Stopwatch();
        stopwatch.Start();
        int count = 0;
        for (uint i = 0; i < 1000000000; ++i) {
            var isMultipleOf16 = i % 16 == 0;
            count += isMultipleOf16 ? 1 : 0;
        }
        stopwatch.Stop();
        Console.WriteLine("Multi-line test  --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
    }
}
Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • I updated the [repro project](https://github.com/breyed/PerfTest) to include an "if" test. I made single- and multi-line variants. On both x64 and x86, the single-line version is faster (when there is no [alignment penalty](http://stackoverflow.com/a/10407346/145173)). I also made a variants that don't have any conditional code in the loop at all (just bit math). They're neck and neck on x86 (I haven't checked the assembly). On x64, the version with a local variable runs faster! It's still amazing that the local variable ever matters at all. – Edward Brey May 03 '12 at 02:23
1

I tend to think of it like this: the people who work on the compiler can only do so much stuff per year. If in that time they could implement lambdas or lots of classical optimizations, I'd vote for lambdas. C# is a language that's efficient in terms of code reading and writing effort, rather than in terms of execution time.

So it's reasonable for the team to concentrate on features that maximize the reading/writing efficiency, rather than the execution efficiency in a certain corner case (of which there are probably thousands).

Initially, I believe, the idea was that the JITter would do all the optimization. Unfortunately the JITting takes noticeable amounts of time, and any advanced optimizations will make it worse. So that didn't work out as well as one might have hoped.

One thing I found about programming really fast code in C# is that quite often you hit a severe GC bottleneck before any optimization like you mention would make a difference. Like if you allocate millions of objects. C# leaves you very little in terms of avoiding the cost: you can use arrays of structs instead, but the resulting code is really ugly in comparison. My point being that many other decisions about C# and .NET make such specific optimizations less worthwhile than they would be in something like a C++ compiler. Heck, they even dropped the CPU-specific optimizations in NGEN, trading performance for programmer (debugger) efficiency.

Having said all this, I'd love C# that actually made use of optimizations that C++ made use of since the 1990s. Just not at the expense of features like, say, async/await.

Roman Starkov
  • 59,298
  • 38
  • 251
  • 324
  • 4
    I'd be extremely wary of reading too much into articles from 2005 and .net 1.1! Quite a lot has changed in the last 7 years. – Jason Williams May 06 '12 at 19:39