49

I have these two chunks of code in C#:

First

class Program
{
    static Stack<int> S = new Stack<int>();

    static int Foo(int n) {
        if (n == 0)
            return 0;
        S.Push(0);
        S.Push(1);
        ...
        S.Push(999);
        return Foo( n-1 );
    }
}

Second

class Program
{
    static Stack S = new Stack();

    static int Foo(int n) {
        if (n == 0)
            return 0;
        S.Push(0);
        S.Push(1);
        ...
        S.Push(999);
        return Foo( n-1 );
    }
}

They both do the same:

  1. Create a Stack (generic in <int> for the first example and a stack of object for the second one).

  2. Declare a method that calls itself recursively n times (n >= 0) and in every step push 1000 integers inside the created stack.

When I run the first example with Foo(30000) no exception occurs, however the second example crashes with Foo(1000), just n = 1000.

When I saw the CIL generated for both cases the only difference was the boxing part for every push:

First

IL_0030:  ldsfld     class [System]System.Collections.Generic.Stack`1<int32> Test.Program::S
IL_0035:  ldc.i4     0x3e7
IL_003a:  callvirt   instance void class [System]System.Collections.Generic.Stack`1<int32>::Push(!0)
IL_003f:  nop

Second

IL_003a:  ldsfld     class [mscorlib]System.Collections.Stack Test.Program::S
IL_003f:  ldc.i4     0x3e7
IL_0044:  box        [mscorlib]System.Int32
IL_0049:  callvirt   instance void [mscorlib]System.Collections.Stack::Push(object)
IL_004e:  nop

My question is: Why, if there is not significant overload of CIL's stack for the second example, does it crash "faster" than the first one?

Pang
  • 9,564
  • 146
  • 81
  • 122
lcastillov
  • 2,163
  • 1
  • 11
  • 17
  • 2
    I just ran the code and both crashed at 30k in debug mode and both ran just fine at 30k on release mode. You probably only ran the non-generic version in debug mode. – Servy Mar 04 '15 at 21:25
  • 1
    @JamesThorpe Those calls aren't going to affect the size of the stack frame though, because they won't be on the stack when its recursing. – Servy Mar 04 '15 at 21:25
  • Instead of `F( 1000 )` did you mean `Foo(1000)`? – Steve Wellens Mar 04 '15 at 21:34
  • Yes, I Did. Sorry For That. – lcastillov Mar 04 '15 at 21:36
  • 2
    The stack structures that you create in both examples have nothing to do with the thread stack because they live in the heap. Thus, it doesn't really matter if you call Stack<>.Push method once or million times: it will not affect thread stack usage – Sergey Krusch Mar 04 '15 at 21:39
  • Also, the question itself... Have you double-checked everything before asking? Stack frame size differs for both methods, but not that much (i.e. not 30 times!) – Sergey Krusch Mar 04 '15 at 21:44
  • Sergey Krush, I know that. There is not special reason for using Stack data structure here, just the fact that in my computer the second example above crashed (on debug and release mode). – lcastillov Mar 04 '15 at 21:48
  • 1
    I replicated the OP's issue when compiling to x86. `Stack` crashes at n = 19261; `Stack` crashes at n = 3 (i.e. much later). – Douglas Mar 04 '15 at 21:50
  • 1
    @LeandroCastilloValdés Are you getting a overflow exception, out of memory exception, or something else? – AaronLS Mar 04 '15 at 21:55
  • @Douglas, really? How? I have tried this for x86 as well: stack frame size for the version with templates is 8 bytes, stack frame size for version with objects is 20 bytes. – Sergey Krusch Mar 04 '15 at 21:55
  • @SergeyKrusch: I'm targeting .NET Framework 4.5.1 under Debug mode. Release mode works (doesn't crash) for both code versions. – Douglas Mar 04 '15 at 21:58
  • @Servy Did you rewrite using a loop? On my system, non-generic crashes very, very low in both release and debug (if not looping, as described in post) – Reed Copsey Mar 04 '15 at 22:24

1 Answers1

59

Why, if there is not significant overload of CIL's stack for second example, does it crash "faster" than the first one?

Note that the number of CIL instructions does not accurately represent the amount of work or memory that will be used. A single instruction can be very low impact, or very high impact, so counting CIL instructions is not an accurate way to measure "work".

Also realize that the CIL is not what gets executed. The JIT compiles the CIL to actual machine instructions, with an optimization phase, so the CIL may be very different than the actual executed instructions.

In the second case, since you're using a non-generic collection, every Push call requires the integer to be boxed, as you determined in CIL.

Boxing an integer effectively creates an object which "wraps" the Int32 for you. Instead of just loading a 32-bit integer onto the stack, it now has to load a 32-bit integer onto the stack, then box it, which effectively also loads an object reference onto the stack.

If you inspect this in the Disassembly window, you can see the difference between the generic and non-generic version is dramatic, and much more significant than the generated CIL would suggest.

The generic version effectively compiles to as a series of calls like so:

0000022c  nop 
            S.Push(25);
0000022d  mov         ecx,dword ptr ds:[03834978h] 
00000233  mov         edx,19h 
00000238  cmp         dword ptr [ecx],ecx 
0000023a  call        71618DD0 
0000023f  nop 
            S.Push(26);
00000240  mov         ecx,dword ptr ds:[03834978h] 
00000246  mov         edx,1Ah 
0000024b  cmp         dword ptr [ecx],ecx 
0000024d  call        71618DD0 
00000252  nop 
            S.Push(27);

The non-generic, on the other hand, has to create the boxed objects, and instead compiles to:

00000645  nop 
            S.Push(25);
00000646  mov         ecx,7326560Ch 
0000064b  call        FAAC20B0 
00000650  mov         dword ptr [ebp-48h],eax 
00000653  mov         eax,dword ptr ds:[03AF4978h] 
00000658  mov         dword ptr [ebp+FFFFFEE8h],eax 
0000065e  mov         eax,dword ptr [ebp-48h] 
00000661  mov         dword ptr [eax+4],19h 
00000668  mov         eax,dword ptr [ebp-48h] 
0000066b  mov         dword ptr [ebp+FFFFFEE4h],eax 
00000671  mov         ecx,dword ptr [ebp+FFFFFEE8h] 
00000677  mov         edx,dword ptr [ebp+FFFFFEE4h] 
0000067d  mov         eax,dword ptr [ecx] 
0000067f  mov         eax,dword ptr [eax+2Ch] 
00000682  call        dword ptr [eax+18h] 
00000685  nop 
            S.Push(26);
00000686  mov         ecx,7326560Ch 
0000068b  call        FAAC20B0 
00000690  mov         dword ptr [ebp-48h],eax 
00000693  mov         eax,dword ptr ds:[03AF4978h] 
00000698  mov         dword ptr [ebp+FFFFFEE0h],eax 
0000069e  mov         eax,dword ptr [ebp-48h] 
000006a1  mov         dword ptr [eax+4],1Ah 
000006a8  mov         eax,dword ptr [ebp-48h] 
000006ab  mov         dword ptr [ebp+FFFFFEDCh],eax 
000006b1  mov         ecx,dword ptr [ebp+FFFFFEE0h] 
000006b7  mov         edx,dword ptr [ebp+FFFFFEDCh] 
000006bd  mov         eax,dword ptr [ecx] 
000006bf  mov         eax,dword ptr [eax+2Ch] 
000006c2  call        dword ptr [eax+18h] 
000006c5  nop 

Here you can see the significance of boxing.

In your case, boxing the integer causes the boxed object references to get loaded onto the stack. On my system, this is causing a stackoverflow on any calls larger than Foo(127) (in 32 bit), which suggests that the integers and boxed object references (4 bytes each) are all being kept on the stack, as 127*1000*8==1016000, which is dangerously close to the default 1 MB thread stack size for .NET applications.

When using the generic version, since there is no boxed object, the integers aren't having to all be stored on the stack, and the same register is being reused. This allows you to recurse significantly more (>40000 on my system) before you use up the stack.

Note that this will be CLR version and platform dependent, as there is a different JIT on x86/x64, as well.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Reed Copsey
  • 554,122
  • 78
  • 1,158
  • 1,373
  • 2
    When I tested the issue, I used a for loop – `for (int i = 0; i <= 999; i++) S.Push(i);` – and got behaviour similar to the OP's. This would only fit your explanation if the compiler was performing loop unrolling. Do you assume this is the case? – Douglas Mar 04 '15 at 22:13
  • @Douglas With a for loop, you get significantly different IL and code - I actually wrote it out long hand, as described in the post, to get the above. With a for loop, the post-JIT code doesn't exhibit the same problems (on my system) – Reed Copsey Mar 04 '15 at 22:14
  • @ReedCopsey: Fair enough. Thanks for the detailed explanation above! – Douglas Mar 04 '15 at 22:15
  • 2
    Is there any good reason for the JIT to leave the references on the stack? – user253751 Mar 05 '15 at 08:46
  • 1
    @immibis In this case, no, but it's not optimizing it away for whatever reason. There are many more potential optimization opportunities in the JIT that are still not being exploited... This seems like a good one. – Reed Copsey Mar 05 '15 at 21:50
  • @Douglas , `for (int i = 0; i <= 999; i++) S.Push(i);` Could you please add a repro code working when run without debugger? I've checked quickly debug/release and 3.5..4.5 in different combinations, and there was no StackOverflowException. Looks like "blame the debugger" thing. Thanks! – Sinix Mar 11 '15 at 10:41
  • 1
    @Sinix As I said, it doesn't happen with for loops for me, only when inlining the calls... – Reed Copsey Mar 11 '15 at 21:22
  • @Reed Copsey Whoops, missed that. Thanks again:) – Sinix Mar 12 '15 at 05:47
  • 1
    @Sinix: http://ideone.com/5tec26. `StackOverflowException` still occurs for "Start Without Debugging" (albeit later). I'm compiling as x86, targeting .NET Framework 4.5.1 under Debug mode. Doesn't occur on ideone since it runs Mono. – Douglas Mar 15 '15 at 08:32
  • @Douglas, thanks! for loop does nothing special here. You can replace it with `S.Push(n);`, same result. – Sinix Mar 16 '15 at 09:32