C# & .NET: stackalloc

Question

I have a few questions about the functionality of the stackalloc operator.

How does it actually allocate? I thought it does something like:
```
void* stackalloc(int sizeInBytes)
{
    void* p = StackPointer (esp);
    StackPointer += sizeInBytes;
    if(StackPointer exceeds stack size)
        throw new StackOverflowException(...);
    return p;
}
```
But I have done a few tests, and I'm not sure that's how it work. We can't know exactly what it does and how it does it, but I want to know the basics.

I thought that stack allocation (Well, I am actually sure about it) is faster than heap allocation. So why does this example:

 class Program
 {
     static void Main(string[] args)
     {
         Stopwatch sw1 = new Stopwatch();
         sw1.Start();
         StackAllocation();
         Console.WriteLine(sw1.ElapsedTicks);

         Stopwatch sw2 = new Stopwatch();
         sw2.Start();
         HeapAllocation();
         Console.WriteLine(sw2.ElapsedTicks);
     }
     static unsafe void StackAllocation()
     {
         for (int i = 0; i < 100; i++)
         {
             int* p = stackalloc int[100];
         }
     }
     static void HeapAllocation()
     {
         for (int i = 0; i < 100; i++)
         {
             int[] a = new int[100];
         }
     }
 }

gives the average results of 280~ ticks for stack allocation, and usually 1-0 ticks for heap allocation? (On my personal computer, Intel Core i7).

On the computer I am using now (Intel Core 2 Duo), the results make more sense that the previous ones (Probably because Optimize code was not checked in VS): 460~ ticks for stack allocation, and about 380 ticks for heap allocation.

But this still doesn't make sense. Why is it so? I guess that the CLR notices that we don't use the array, so maybe it doesn't even allocate it?

100 iterations isn't really enough to show the trend; I bumped it to 1M, tweaking the code such that it didn't blow the stack. I still get larger stackalloc times, but now with reasonable confidence ;p http://pastie.org/3004442 . Re the last, the compiler can't not allocate without changing the semantics, which is isn't meant to do; I doubt it is removing the alloc. — Marc Gravell, Dec 12 '11 at 10:16
I'm not sure about .NET but in general a stack overflow is detected at the CPU level and results in a CPU trap or similar exception mechanism. Recovering from a stack overflow can be hard and often leads to termination of the thread that owns the stack. — Martin Liversage, Dec 12 '11 at 10:18
Interesting observation; it works much more evenly for small blocks - say, int[15]. And of course avoids the GC impact. — Marc Gravell, Dec 12 '11 at 10:26
I got awful performance on my work machine (Core2Duo). Stack allocation using Mark's tweaked code took ten times as long. Interestingly (to me at least), once I actually used the arrays then the stackalloc method became much faster. See here: http://pastie.org/3004524 — Tom Chantler, Dec 12 '11 at 10:40
You can't expect to benchmark this with any accuracy. Part of the costs of the heap version are made during GC. — H H, Dec 12 '11 at 11:15
With heap allocation, the allocation itself is cheap. It's the collection where most of the cost occurs. — CodesInChaos, Dec 12 '11 at 11:16
@MartinLiversage not quite the CPU level. .NET protects the end of the stack, and throws when it hits that area. If it didn't, then the OS (not CPU) level would deal with the overflow, but in a way that would be harder for .NET to turn into a manageable exception. — Jon Hanna, Dec 12 '11 at 12:15

Jon Hanna · Accepted Answer · 2011-12-12T12:56:55.733

A case where stackalloc is faster:

 private static volatile int _dummy; // just to avoid any optimisations
                                         // that have us measuring the wrong
                                         // thing. Especially since the difference
                                         // is more noticable in a release build
                                         // (also more noticable on a multi-core
                                         // machine than single- or dual-core).
 static void Main(string[] args)
 {
     System.Diagnostics.Stopwatch sw1 = new System.Diagnostics.Stopwatch();
     Thread[] threads = new Thread[20];
     sw1.Start();
     for(int t = 0; t != 20; ++t)
     {
        threads[t] = new Thread(DoSA);
        threads[t].Start();
     }
     for(int t = 0; t != 20; ++t)
        threads[t].Join();
     Console.WriteLine(sw1.ElapsedTicks);

     System.Diagnostics.Stopwatch sw2 = new System.Diagnostics.Stopwatch();
     threads = new Thread[20];
     sw2.Start();
     for(int t = 0; t != 20; ++t)
     {
        threads[t] = new Thread(DoHA);
        threads[t].Start();
     }
     for(int t = 0; t != 20; ++t)
        threads[t].Join();
     Console.WriteLine(sw2.ElapsedTicks);
     Console.Read();
 }
 private static void DoSA()
 {
    Random rnd = new Random(1);
    for(int i = 0; i != 100000; ++i)
        StackAllocation(rnd);
 }
 static unsafe void StackAllocation(Random rnd)
 {
    int size = rnd.Next(1024, 131072);
    int* p = stackalloc int[size];
    _dummy = *(p + rnd.Next(0, size));
 }
 private static void DoHA()
 {
    Random rnd = new Random(1);
    for(int i = 0; i != 100000; ++i)
        HeapAllocation(rnd);
 }
 static void HeapAllocation(Random rnd)
 {
    int size = rnd.Next(1024, 131072);
    int[] a = new int[size];
    _dummy = a[rnd.Next(0, size)];
 }

Important differences between this code and that in the question:

We have several threads running. With stack allocation, they are allocating in their own stack. With heap allocation, they are allocating from a heap shared with other threads.
Larger sizes allocated.
Different sizes allocated each time (though I seeded the random generator to make the tests more deterministic). This makes heap fragmentation more likely to happen, making heap allocation less efficient than with identical allocations each time.

As well as this, it's also worth noting that stackalloc would often be used as an alternative to using fixed to pin an array on the heap. Pinning arrays is bad for heap performance (not just for that code, but also for other threads using the same heap), so the performance impact would be even greater then, if the claimed memory would be in use for any reasonable length of time.

While my code demonstrates a case where stackalloc gives a performance benefit, that in the question is probably closer to most cases where someone might eagerly "optimise" by using it. Hopefully the two pieces of code together show that whole stackalloc can give a boost, it can also hurt performance a lot too.

Generally, you shouldn't even consider stackalloc unless you are going to need to use pinned memory for interacting with unmanaged code anyway, and it should be considered an alternative to fixed rather than an alternative to general heap allocation. Use in this case still requires caution, forethought before you start, and profiling after you finish.

Use in other cases could give a benefit, but it should be far down the list of performance improvements you would try.

Edit:

To answer part 1 of the question. Stackalloc is conceptually much as you describe. It obtains a chunk of the stack memory, and then returns a pointer to that chunk. It doesn't check the memory will fit as such, but rather if it attempts to obtain memory into the end of the stack - which is protected by .NET on thread creation - then this will cause the OS to return an exceptioin to the runtime, which it then turns into a .NET managed exception. Much the same happens if you just allocate a single byte in a method with infinite recursion - unless the call got optimised to avoid that stack allocation (sometimes possible), then a single byte will eventually add up to enough to trigger the stack overflow exception.

With compacting GC, like the CLR uses, heap fragmentation does never happen. Compacting GC is not used for the Large Object Heap, but it seems you're not allocating objects larger than 85 kB, so LOH shouldn't be used. — svick, Dec 12 '11 at 12:44
Thanks @svick. I has originally intended to have the LOH hit some of the time, and then forgot that while writing. Post editted. That said, I wonder if causing temporary fragmentation that the GC has to compact has any effect, but that's for another experiment. — Jon Hanna, Dec 12 '11 at 12:58

Martin Liversage · Answer 2 · 2011-12-12T14:55:42.573

3

I can't give an exact answer but stackalloc is implemented using the IL opcode localloc. I looked at the machine code generated by a release build for stackallocand it was more convoluted than I expected. I don't know if localloc will check the stack size as you indicate by your if or if the stack overflow is detected by the CPU when the hardware stack actually overflows.

The comments to this answer indicate that the link provided to localloc allocates space from "the local heap". The problem is that there is no good online reference for MSIL except the actual standard available in PDF format. The link above is from the System.Reflection.Emit.OpCodes class which isn't about MSIL but rather a library for generating MSIL.

However, in the standards document ECMA 335 - Common Language Infrastructure there is a more precise description:

Part of each method state is a local memory pool. Memory can be explicitly allocated from the local memory pool using the localloc instruction. All memory in the local memory pool is reclaimed on method exit, and that is the only way local memory pool memory is reclaimed (there is no instruction provided to free local memory that was allocated during this method invocation). The local memory pool is used to allocate objects whose type or size is not known at compile time and which the programmer does not wish to allocate in the managed heap.

So basically the "local memory pool" is what is otherwise known as "the stack" and the C# language uses the stackalloc operator to allocate from this pool.
In a release build the optimizer is smart enough to completely remove the call to HeapAllocation resulting in much lower execution time. It seems that it isn't smart enough to perform the same optimization when using stackalloc. If you either turn off optimization or in some way uses the allocated buffer you will see that stackalloc is slightly faster.

edited Dec 12 '11 at 14:55

answered Dec 12 '11 at 11:18

Martin Liversage

104,481
22
209
256

1

I think it makes sense that methods with `stackalloc` won't be inlined, since clearing (their part of) the stack after they run is quite important here. – svick Dec 12 '11 at 11:28
1

Add `[MethodImpl(MethodImplOptions.NoInlining)]` to the OP's version of `HeapAllocation` and (adding enough calls to make it worth measuring) it's still faster than the stack-allocation based approach. It's just not causing enough pain to the GC for the benefits of stack allocation to be felt. Inlining just removes the overhead of the call itself. – Jon Hanna Dec 12 '11 at 12:39
@Martin Liversage But I just had a look on `localloc` in MSDN, and it is said: *Allocates space from the local heap*, not stack... So how come `stackalloc` uses it? @Jon Hanna, HA does become slower but it is still faster than SA, why is that? SA is just changing the value of `esp` and HA involves calling to Win API function (Maybe `HeapAlloc`?) and doing more complex stuff... – Jong Dec 12 '11 at 14:04
@Jong, where on MSDN did you see that? The link Martin gave says "Allocates a certain number of bytes from the local dynamic memory pool", which in a stack-based implementation (CLI's do not have to use a stack at all) means the stack. Anyway, both have some complexity, but why would HA call a Win API? The .NET heap is hardly going to be call-per-allocation, or every `new` would be too expensive to do in a loop. – Jon Hanna Dec 12 '11 at 14:23
@JonHanna, In the *Remarks* section there is a brief description saying that, which is different from what the description in the top says. Maybe it's their mistake in writing? Giving 2 different meanings for the same opcode... Yes it might be a mistake. And about HA, it sure makes more sense that the CLR doesn't call a HA function for each `new`. I'll search the web for more information about it. – Jong Dec 12 '11 at 14:32
@Jong: The linked MSDN page could use some editing. I checked the CLI standards document and updated my answer to include some of the information. – Martin Liversage Dec 12 '11 at 14:51
@Jong, interesting. It does say "local heap" (do local heaps still even differ from global in Windows?). Experimenting shows it to return a pointer that (which cast to `int`) is as close to one for a local integer as you would expect. I wonder if by "local heap" they mean the thread-specific memory which in a stack-based implementation would be the stack, but without restricting themselves to a stack-based implemenation. – Jon Hanna Dec 12 '11 at 14:59
@JonHanna: If you go to section 3.47 in ECMA 335 linked above you will see that `localloc` allocates memory from the "local dynamic memory pool" to not be specific about the implementation. However, the section also states that it will throw a `StackOverflowException` if there is insufficient memory – Martin Liversage Dec 12 '11 at 19:36
Interesting. In which case "StackOverflowException" would be a misnomer, but consistent with those implementations that are stack-based. Makes sense. – Jon Hanna Dec 12 '11 at 19:39

C# & .NET: stackalloc

2 Answers2

Linked