switch statement performance depends on code size of un-entered inner cases

Question

My C# code generator spits out nested switch statements into some method in a class which I compile dynamically at runtime, load and instantiate, and then execute. Execution time of this is up to 100x faster when compared to the generic, non-compiled version which has to use hash tables (as the hash table keys, which turn into switch cases in the compiled version, are only known at runtime).

As the switch statements get bigger, performance stays pretty much the same, if the number of "switch hops" that are actually executed, do not change, i.e. adding code in case statements that do not get executed does not affect performance.

However, this works up until a certain code size, and then suddenly performance drops by a factor of 7 (when running in 32 bit mode) or 12 (running in native 64 bit mode).

I had a look at the JITted code, and it does in fact change for parts of the code that are not changed, as the code grows. (Not being familiar with assembly and instruction sets,) I assume there is something like a "short jump" and "long jump", the former being limited by the amount of bytes it can jump. Could someone elucidate to the high-level programmer why the generated machine code has to be, or is, different?

N.B. I'm aware that I'm testing code that is doing almost nothing, so smallest differences in the machine code naturally have a huge impact on relative performance. But the point of all this is to generate code that does as close to nothing as possible, as it is called hundreds of thousands of times per second.

Here are two different versions of the switch statement head when overall code size is relatively small and performance good, as copied from Visual Studio using an JIT optimized Release build, running in 32-bit mode:

        switch (a)
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  dec         edx 
00000004  cmp         edx,3Bh 
00000007  jae         0000021D 
0000000d  jmp         dword ptr [edx*4+00773AD8h] 
        {
            case 1: return 1;

And, with slightly more code in the un-entered case blocks - but still as fast:

        switch (a)
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  lea         eax,[edx-1] 
00000006  cmp         eax,3Bh 
00000009  jae         00001C51 
0000000f  jmp         dword ptr [eax*4+00A35830h] 
        {
            case 1:
                {

And this is the version for the much bigger code, which turns out to be 7 times slower.

        switch (a)
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  push        edi 
00000004  push        esi 
00000005  sub         esp,0FCh 
0000000b  mov         esi,ecx 
0000000d  lea         edi,[ebp+FFFFFEFCh] 
00000013  mov         ecx,3Eh 
00000018  xor         eax,eax 
0000001a  rep stos    dword ptr es:[edi] 
0000001c  mov         ecx,esi 
0000001e  mov         dword ptr [ebp-0Ch],edx 
00000021  mov         eax,dword ptr [ebp-0Ch] 
00000024  mov         dword ptr [ebp-10h],eax 
00000027  mov         eax,dword ptr [ebp-10h] 
0000002a  dec         eax 
0000002b  cmp         eax,3Bh 
0000002e  jae         00000037 
00000030  jmp         dword ptr [eax*4+0077C488h] 
00000037  jmp         0000888F 
        {
            case 1:
                {

N.B. I'm only posting the head of the switch statement, as that is the only thing that gets executed in my test, because I always call the method in question with a value that is in no case statement (and there's no default case), so it will just fall through and (I hope) not execute any code inside the switch.

The second version seems to be allocating and zeroing a big chunk (252 bytes) of stack for local variables, presumably used in one of the switch cases. — Jester, Jul 10 '15 at 10:24

score 2 · Accepted Answer · answered Jul 10 '15 at 16:46

It looks like the main difference between your first two examples and the last is, as Jester pointed out, the last example allocates 252 bytes on the stack and zeros it. This wouldn't be because the code in the switch statement is simply bigger, but because the code in the switch statement uses local variables and temporaries that the previous two examples don't. The first two examples either don't use any local variables or temporaries, or the JIT optimizer managed to allocate them all in registers.

The other notable issue with the last example are the MOV instructions at addresses 0000001e - 00000027. These instructions store the switch value a in two different locations the stack and reload the value from the stack each time. My guess is that that the values stored on the stack are never used again, making this code completely unnecessary. Even if they are used later in the code there's no need reload the value from the stack. Either way the optimizer has failed. If I'm right and these stack locations are unused the optimizer may have failed to eliminate other unnecessary temporaries resulting in even more stack space being used than necessary.

I should point out that difference between the first and second example shows how the optimizer can get a similar case right. The code is different in the first two examples because the optimizer preserves the value of a in the second example, presumably because a is used later in the code. In all examples the assembly code normalizes the range of the switch statement from 1 - 60 to 0 - 59. This saves both an entry in the jump table and a couple of instructions. In the first example the value of a is lost when it does this, in the later two examples the value of a is preserved. The second example just leaves the value of a in the register it was passed into the function with. The third example also preserves it in its original register and then saves two additional copies in separate locations on the stack.

If the overwhelmingly usual case is that no case in the switch statement is executed then a possible solution would be to put a check to see if the if the switch value is in range in its own function. That function would then only call the function containing the switch statement if necessary. Otherwise you can try moving less frequently used and/or high stack usage cases out of the switch statement to their own functions.

(I'm not familiar with Microsoft's JIT optimizer, but you may need to use the NoInlining attribute to prevent it from inlining the separated functions back together.)

Very helpful answer, and an interesting, albeit counter-intuitive, suggestion to separate out code into a separate, noninlineable function. I will check out and if I can convert your hints into actual performance. — Evgeniy Berezovsky, Jul 13 '15 at 00:11

switch statement performance depends on code size of un-entered inner cases

1 Answers1