17

I'm wondering if someone can explain to me what exactly the compiler might be doing for me to observe such extreme differences in performance for a simple method.

 public static uint CalculateCheckSum(string str) { 
    char[] charArray = str.ToCharArray();
    uint checkSum = 0;
    foreach (char c in charArray) {
        checkSum += c;
    }
    return checkSum % 256;
 }

I'm working with a colleague doing some benchmarking/optimizations for a message processing application. Doing 10 million iterations of this function using the same input string took about 25 seconds in Visual Studio 2012, however when the project was built using the "Optimize Code" option turned on the same code executed in 7 seconds for the same 10 million iterations.

I'm very interested to understand what the compiler is doing behind the scenes for us to be able to see a greater than 3x performance increase for a seemingly innocent block of code such as this.

As requested, here is a complete Console application that demonstrates what I am seeing.

class Program
{
    public static uint CalculateCheckSum(string str)
    {
        char[] charArray = str.ToCharArray();
        uint checkSum = 0;
        foreach (char c in charArray)
        {
            checkSum += c;
        }
        return checkSum % 256;
    }

    static void Main(string[] args)
    {
        string stringToCount = "8=FIX.4.29=15135=D49=SFS56=TOMW34=11752=20101201-03:03:03.2321=DEMO=DG00121=155=IBM54=138=10040=160=20101201-03:03:03.23244=10.059=0100=ARCA10=246";
        Stopwatch stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < 10000000; i++)
        {
            CalculateCheckSum(stringToCount);
        }
        stopwatch.Stop();
        Console.WriteLine(stopwatch.Elapsed);
    }
}

Running in debug with Optimization off I see 13 seconds, on I get 2 seconds.

Running in Release with Optimization off 3.1 seconds and on 2.3 seconds.

Jesse Carter
  • 20,062
  • 7
  • 64
  • 101
  • 6
    A few questions; are you running in `Release` mode? Are you using the `Stopwatch` for timing? – Mike Perrenoud Jan 29 '14 at 17:52
  • 1
    Hazarding a guess it's probably turning the `foreach` loop into a `for` loop which would prevent it from having to switch from arrays to enumerations. See http://stackoverflow.com/questions/365615/in-net-which-loop-runs-faster-for-or-foreach for some discussion on this. – CassOnMars Jan 29 '14 at 17:52
  • 1
    a string is already an array of chars. the `ToCharArray` is redundant... thus, the compiler is likely doing something along the lines of `for(int i = 0; i < str.length; i ++) ...` – Alex Jan 29 '14 at 17:53
  • @d_r_w: It ought to do that without release mode too. – SLaks Jan 29 '14 at 17:54
  • @MichaelPerrenoud We are just running out of a unit test but calling the method from a dll that has been built with the Optimize Code option set to off/on. And yes we were using a Stopwatch for timing – Jesse Carter Jan 29 '14 at 17:55
  • @d_r_w: I just tested and it does for both. – SLaks Jan 29 '14 at 17:55
  • Did you have a debugger attached? – SLaks Jan 29 '14 at 17:56
  • @Slaks no, in both cases when testing we simply ran the Unit test using our test runner without debugging turned on – Jesse Carter Jan 29 '14 at 17:58
  • 3
    Going to sit back and wait for Eric Lippert to provide the definitive, speculation-free response. – Yuck Jan 29 '14 at 17:58
  • 3
    @JesseCarter: I would *strongly* suggest you run it standalone (e.g. as a console app) rather than from a unit test runner. Remove as many extraneous bits as possible. – Jon Skeet Jan 29 '14 at 17:58
  • 2
    Can you post your input string as well? Or more ideally, a short but complete program which demonstrates the difference. – Jon Skeet Jan 29 '14 at 18:01
  • I'm not sure what you're expecting from this answer. Are you hoping to get a comprehensive list of all of the optimizations the compiler performs, because that's what it sounds like you're asking, and that's *way* too broad of a question for SO. A book could likely be written on the subject. – Servy Jan 29 '14 at 18:02
  • Just use ILDasm and look at the difference in the compiled code. OffTopic but your checksum function will have a lot of collisions. – the_lotus Jan 29 '14 at 18:04
  • @Servy, no wasn't looking for a comprehensive list. Just seemed like a big difference and I was wondering if there was something that was inherently inefficient about the way I had written it originally that the compiler would optimize away. Was hoping to learn if its something I can be more proactive about when doing actually coding and not simply relying on the Compiler for these sorts of optimizations. – Jesse Carter Jan 29 '14 at 18:13
  • @JesseCarter But if the compiler is able to perform the optimizations for you then clearly *you don't need to worry about them*. If you care at all about performance you're going to be running the optimized, not unoptimized code. You should be worried about making optimizations that the compiler *can't* make with your time, rather than wasting time optimizing something that can just be optimized for you by simply checking a box. – Servy Jan 29 '14 at 18:14
  • @Servy I understand that. However, I was very curious about the discrepancies in the performance here and simply wanted to know if there was something inherently inefficient about the code that was written. Apologies if that was an inappropriate question I am new to these sorts of things. – Jesse Carter Jan 29 '14 at 18:16
  • 1
    @JesseCarter As a rule trying to optimize code frequently hurts readability. Doing things like re-using variables, avoiding storing information in a variable, etc., for example, is a simple way to reduce memory, but makes *super* unreadable code. A compiler can trivially perform such an optimization, and when it does it, readability isn't harmed as no human ever needs to read the optimized code. Writing effective, readable, clearly correct code, and letting the compiler perform trivial refactors to improve the code gives you the best of both worlds. – Servy Jan 29 '14 at 18:20
  • @Servy In fact manual optimizations can often interfere with the compiler's attempts to optimize the code and even lead to slower code. – Rick Love Jan 29 '14 at 18:25
  • I don't know at me the application gives 4s with non optimized code, 3s with optimized, that happen only if I start the application outside of VS otherwise they're both 13 seconds – Fabio Marcolini Jan 29 '14 at 18:35

3 Answers3

8

To look at what the C# compiler does for you, you need to look at the IL. If you want to see how that affects the JITted code, you'll need to look at the native code as described by Scott Chamberlain. Be aware that the JITted code will vary based on processor architecture, CLR version, how the process was launched, and possibly other things.

I would usually start with the IL, and then potentially look at the JITted code.

Comparing the IL using ildasm can be slightly tricky, as it includes a label for each instruction. Here are two versions of your method compiled with and without optimization (using the C# 5 compiler), with extraneous labels (and nop instructions) removed to make them as easy to compare as possible:

Optimized

  .method public hidebysig static uint32 
          CalculateCheckSum(string str) cil managed
  {
    // Code size       46 (0x2e)
    .maxstack  2
    .locals init (char[] V_0,
             uint32 V_1,
             char V_2,
             char[] V_3,
             int32 V_4)
    ldarg.0
    callvirt   instance char[] [mscorlib]System.String::ToCharArray()
    stloc.0
    ldc.i4.0
    stloc.1
    ldloc.0
    stloc.3
    ldc.i4.0
    stloc.s    V_4
    br.s       loopcheck
  loopstart:
    ldloc.3
    ldloc.s    V_4
    ldelem.u2
    stloc.2
    ldloc.1
    ldloc.2
    add
    stloc.1
    ldloc.s    V_4
    ldc.i4.1
    add
    stloc.s    V_4
  loopcheck:
    ldloc.s    V_4
    ldloc.3
    ldlen
    conv.i4
    blt.s      loopstart
    ldloc.1
    ldc.i4     0x100
    rem.un
    ret
  } // end of method Program::CalculateCheckSum

Unoptimized

  .method public hidebysig static uint32 
          CalculateCheckSum(string str) cil managed
  {
    // Code size       63 (0x3f)
    .maxstack  2
    .locals init (char[] V_0,
             uint32 V_1,
             char V_2,
             uint32 V_3,
             char[] V_4,
             int32 V_5,
             bool V_6)
    ldarg.0
    callvirt   instance char[] [mscorlib]System.String::ToCharArray()
    stloc.0
    ldc.i4.0
    stloc.1
    ldloc.0
    stloc.s    V_4
    ldc.i4.0
    stloc.s    V_5
    br.s       loopcheck

  loopstart:
    ldloc.s    V_4
    ldloc.s    V_5
    ldelem.u2
    stloc.2
    ldloc.1
    ldloc.2
    add
    stloc.1
    ldloc.s    V_5
    ldc.i4.1
    add
    stloc.s    V_5
  loopcheck:
    ldloc.s    V_5
    ldloc.s    V_4
    ldlen
    conv.i4
    clt
    stloc.s    V_6
    ldloc.s    V_6
    brtrue.s   loopstart

    ldloc.1
    ldc.i4     0x100
    rem.un
    stloc.3
    br.s       methodend

  methodend:
    ldloc.3
    ret
  }

Points to note:

  • The optimized version uses fewer locals. This may allow the JIT to use registers more effectively.
  • The optimized version uses blt.s rather than clt followed by brtrue.s when checking whether or not to go round the loop again (this is the reason for one of the extra locals).
  • The unoptimized version uses an additional local to store the return value before returning, presumably to make debugging easier.
  • The unoptimized version has an unconditional branch just before it returns.
  • The optimized version is shorter, but I doubt that it's short enough to be inlined, so I suspect that's irrelevant.
Marius Schulz
  • 15,976
  • 12
  • 63
  • 97
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
6

To get a good understanding, you should look at the IL code generated.

Compile the assembly, then make a copy of it and compile again with the optimizations. Then open both assemblies in .net reflector and compare the difference of the compiled IL.

Update: Dotnet Reflector is available at http://www.red-gate.com/products/dotnet-development/reflector/

Update 2: IlSpy seems like a good open source alternative. http://ilspy.net/

Open Source Alternatives to Reflector?

Community
  • 1
  • 1
Rick Love
  • 12,519
  • 4
  • 28
  • 27
  • 1
    Looking at the IL is not enough, you need to see the JITed output to see the bulk of the optimizations applied. – Scott Chamberlain Jan 29 '14 at 18:04
  • 2
    @ScottChamberlain: Not by the *compiler* optimization flag. The output of the compiler is the IL. Any change due to the compiler flag *must* be present in the IL. How the JIT optimizes things is a different matter. – Jon Skeet Jan 29 '14 at 18:08
  • 3
    Note that you don't need Reflector (a paid product beyond the free trial) - ildasm will do the job just fine. – Jon Skeet Jan 29 '14 at 18:09
  • 3
    [ILSpy](http://ilspy.net/) is a free alternative to Reflector that also decompiles to C# or IL. – Scott Chamberlain Jan 29 '14 at 18:13
5

I don't know what optimizations it is doing but I can show you how you can find out for your self.

First build your code optimized and start it without the debugger attached (the JIT compiler will generate different code if the debugger is attached). Run your code so that you know that section was entered at least once so the JIT Compiler had a chance to process it and in Visual Studio go to Debug->Attach To Process.... From the new menu choose your running application.

Put a breakpoint in the spot you are wondering about and let the program stop, once stopped go to Debug->Windows->Dissasembly. That will show you the compiled code the JIT created and you will be able to inspect what it is doing.

(Click for larger view) enter image description here

Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431
  • 2
    Looking at the JITted code doesn't show the difference in the *compiler* output. The answer which you commented on but was deleted was right in this case, IMO. The flag is to the compiler, and the output of the compiler is the IL. – Jon Skeet Jan 29 '14 at 18:07
  • @JonSkeet however, and please correct me if I am wrong, the optimized IL will be more likely to be further optimized by the JIT due do it being more "JIT friendly" resulting in much larger changes between the IL code and the JIT code. – Scott Chamberlain Jan 29 '14 at 18:11
  • Yes, it will be further optimized by the JIT. But the OP says they're interested in what the *compiler* is doing - so it makes sense to look at the *compiler* output, not the *JIT* output. (The JIT will depend on all kinds of things.) – Jon Skeet Jan 29 '14 at 18:13
  • @JonSkeet: The OP did ask about the compiler, but it sounds (to me) like he is assuming that is where all of the optimization is occurring. If the compiler is outputting IL which is more easily optimized by the JIT then the compiler differences are only half of the equation. – Ed S. Jan 29 '14 at 18:14
  • In my interpretation of the question the OP is more interested in what in total decreased his run time from 25 seconds to 7 seconds which would involve looking at both the performance gains of the IL optimizations and the optimizations of the JITed code. EDIT: What @EdS. said :) – Scott Chamberlain Jan 29 '14 at 18:15
  • @EdS.: It's possible that *both* would be useful - but I'd at least *start* with the IL, as that's the part which is *directly* affected by the compiler output. – Jon Skeet Jan 29 '14 at 18:20
  • Oh, and if you're debugging - that's going to change the JITted code *massively*. Seeing the fully-optimized JITted code is trickier, IME. – Jon Skeet Jan 29 '14 at 18:21
  • @JonSkeet Doesn't the process I describe show the non debug JIT code (start without debug, allow the JIT to run for the method, attach debugger)? – Scott Chamberlain Jan 29 '14 at 19:29
  • @ScottChamberlain: Ah, missed that bit, sorry. My mistake. But I still think the IL is the first place to look, if you want to see what the *compiler* optimizes. – Jon Skeet Jan 29 '14 at 19:33