229

I'm currently working on a very performance critical program and one path I decided to explore that may help reduce resource consumption was increasing my worker threads' stack size so I can move most of the data (float[]s) that I'll be accesing onto the stack (using stackalloc).

I've read that the default stack size for a thread is 1 MB, so in order to move all my float[]s I would have to expand the stack by approximately 50 times (to 50 MB~).

I understand this is generally considered "unsafe" and isn't recommended, but after benchmarking my current code against this method, I've discovered a 530% increase in processing speed! So I can not simply pass by this option without further investigation, which leads me to my question; what are the dangers associated with increasing the stack to such a large size (what could go wrong), and what precautions should I take to minimise such dangers?

My test code,

public static unsafe void TestMethod1()
{
    float* samples = stackalloc float[12500000];

    for (var ii = 0; ii < 12500000; ii++)
    {
        samples[ii] = 32768;
    }
}

public static void TestMethod2()
{
    var samples = new float[12500000];

    for (var i = 0; i < 12500000; i++)
    {
        samples[i] = 32768;
    }
}
Community
  • 1
  • 1
Sam
  • 7,252
  • 16
  • 46
  • 65
  • 98
    +1. Seriously. You ask what LOOKS Like an idiotic question out of the norm and then you make a VERY good case that in your particular scenario it is a sensible thing to consider because you made your homework and measured the outcome. This is VERY good - I miss that with many questions. Very nice - good you consider something like this, sadly many many C# programmers are not aware of those optimization opportunities. Yes, often not needed - but sometimes it is critical and makes a hugh difference. – TomTom Jun 13 '14 at 08:44
  • How were you previously allocating your `float` arrays? You may be able to get a comparative performance gain another way. – Binary Worrier Jun 13 '14 at 08:45
  • @BinaryWorrier The normal way `var myArray = new float[x];`. – Sam Jun 13 '14 at 08:46
  • 5
    I'm interested to see the two codes that have 530% difference in processing speed, solely on account of moving array to stack. That just does not feel right. – Dialecticus Jun 13 '14 at 08:49
  • Do you mean `var myArray = new float[x]`, and were you doing this once per thread? – Binary Worrier Jun 13 '14 at 08:50
  • @BinaryWorrier Yes, sorry typed that up "free-hand". And yes, once per thread. – Sam Jun 13 '14 at 08:51
  • 50mb is approximately 13 million floats, you were allocating a single float array per thread for 13 million floats? – Binary Worrier Jun 13 '14 at 08:53
  • 13
    Before you leap down that road: have you tried using `Marshal.AllocHGlobal` (don't forget to `FreeHGlobal` too) to allocate the data *outside* of managed memory? Then cast the pointer to a `float*`, and you should be sorted. – Marc Gravell Jun 13 '14 at 08:53
  • @MarcGravell No I haven't tried that, yet. I'll see how that "fairs out", thanks. – Sam Jun 13 '14 at 08:55
  • 2
    It does feel right if you do a lot of allocations. Stackalloc bypasses all the GC issues which also can create / does create a very strong locality on processor level. This is one of the things hat look like micro optimizations - unless you write a high performance mathematical program and are having exactly this behavior and it make a difference ;) – TomTom Jun 13 '14 at 08:59
  • @TomTom I suspect "locality" here could be misleading; a copy of the *pointer* could benefit from locality, but the actual data is going to have exactly the same hit/page-fault etc figures as it would anywhere else; there's only so much you can do with a huge ton of data – Marc Gravell Jun 13 '14 at 09:01
  • Locality is more in the direction of the first level cache that already has the stack in the processor cache architecture. – TomTom Jun 13 '14 at 09:06
  • 1
    @TomTom yeah, you're not going to get much of a 50MB chunk into processor cache, and when it does: it'll be pretty much the same as if that data had come from anywhere else. – Marc Gravell Jun 13 '14 at 09:48
  • First - not true ;) SOme xens are hugh in caches, but mostly: guess what, the GC moving because of that pressure does not really help a lot either ;) – TomTom Jun 13 '14 at 10:23
  • You keep saying 50 MB but 50,000,000 `float`s means 4 times the number of bytes - 200 MB. – Vercas Jun 13 '14 at 13:11
  • @Vercas Sorry, I'm mixing up the byte count vs float count. I mean 12,500,000 floats. – Sam Jun 13 '14 at 13:12
  • @Sam Ah, okay. Did you try what Marc Gravell suggested on his first comment? I am really curious about the outcome. – Vercas Jun 13 '14 at 13:13
  • 1
    @Vercas Yes I've just finished the benchmarks, Mark's method is 115% faster than method 2. And PFM's method is 427% faster than Mark's. – Sam Jun 13 '14 at 13:20
  • @Sam That is really odd... I'll try benchmarking too. – Vercas Jun 13 '14 at 13:26
  • 6
    My suspicion: one of these methods triggers bounds-checking on every loop iteration while the other one does not, or it is optimized away. – pjc50 Jun 13 '14 at 13:47
  • @Vercas Here's the code I used for [Mark's method](https://gist.github.com/ArcticWinter/9525b8b98be7e241bd4d#file-mark-s-method) & [PFM's method](https://gist.github.com/ArcticWinter/f025f2efd67dcb226ccd#file-pfm-s-method). – Sam Jun 13 '14 at 14:02
  • Just use malloc (or something equivalent). The overhead of allocating must be tiny compared to processing 50MB of data. – usr Jun 13 '14 at 14:58
  • 1
    Considering that both test methods do nothing and should be removed entirely after being JIT compiled, I seriously would doubt those benchmark results.. You're probably just measuring the interpreted time where the unsafe code is not generating any bounds checks while the safe code does. – Voo Jun 14 '14 at 20:07

8 Answers8

46

Upon comparing test code with Sam, I determined that we are both right!
However, about different things:

  • Accessing memory (reading and writing) is just as fast wherever it is - stack, global or heap.
  • Allocating it, however, is fastest on stack and slowest on heap.

It goes like this: stack < global < heap. (allocation time)
Technically, stack allocation isn't really an allocation, the runtime just makes sure a part of the stack (frame?) is reserved for the array.

I strongly advise being careful with this, though.
I recommend the following:

  1. When you need to create arrays frequently which never leave the function (e.g. by passing its reference), using the stack will be an enormous improvement.
  2. If you can recycle an array, do so whenever you can! The heap is the best place for long-term object storage. (polluting global memory isn't nice; stack frames can disappear)

(Note: 1. only applies to value types; reference types will be allocated on the heap and the benefit will be reduced to 0)

To answer the question itself: I have not encountered any problem at all with any large-stack test.
I believe the only possible problems are a stack overflow, if you are not careful with your function calls and running out of memory when creating your thread(s) if the system is running low.

The section below is my initial answer. It is wrong-ish and the tests aren't correct. It is kept only for reference.


My test indicates the stack-allocated memory and global memory is at least 15% slower than (takes 120% the time of) heap-allocated memory for usage in arrays!

This is my test code, and this is a sample output:

Stack-allocated array time: 00:00:00.2224429
Globally-allocated array time: 00:00:00.2206767
Heap-allocated array time: 00:00:00.1842670
------------------------------------------
Fastest: Heap.

  |    S    |    G    |    H    |
--+---------+---------+---------+
S |    -    | 100.80 %| 120.72 %|
--+---------+---------+---------+
G |  99.21 %|    -    | 119.76 %|
--+---------+---------+---------+
H |  82.84 %|  83.50 %|    -    |
--+---------+---------+---------+
Rates are calculated by dividing the row's value to the column's.

I tested on Windows 8.1 Pro (with Update 1), using an i7 4700 MQ, under .NET 4.5.1
I tested both with x86 and x64 and the results are identical.

Edit: I increased the stack size of all threads 201 MB, the sample size to 50 million and decreased iterations to 5.
The results are the same as above:

Stack-allocated array time: 00:00:00.4504903
Globally-allocated array time: 00:00:00.4020328
Heap-allocated array time: 00:00:00.3439016
------------------------------------------
Fastest: Heap.

  |    S    |    G    |    H    |
--+---------+---------+---------+
S |    -    | 112.05 %| 130.99 %|
--+---------+---------+---------+
G |  89.24 %|    -    | 116.90 %|
--+---------+---------+---------+
H |  76.34 %|  85.54 %|    -    |
--+---------+---------+---------+
Rates are calculated by dividing the row's value to the column's.

Though, it seems the stack is actually getting slower.

Vercas
  • 8,931
  • 15
  • 66
  • 106
  • I'd have to disagree, according to [my benchmark's results](https://gist.github.com/ArcticWinter/bc701d47ddfc03cfb401#file-benchmark) (see comment at bottom of page for results) show's that the stack is marginally faster than global, and much quicker than the heap; and to be definitely sure that my results are accurate a ran the test 20 times, and each method was called 100 times per test iteration. Are you definitely running your benchmark correctly? – Sam Jun 13 '14 at 19:24
  • I am getting very inconsistent results. With full trust, x64, release config, no debugger, they are all equally fast (less than 1% difference; fluctuating) while yours is indeed much faster with a stack. I need to test further! **Edit**: Yours SHOULD throw a stack overflow exception. You merely allocate enough for the array. O_o – Vercas Jun 13 '14 at 19:40
  • Yeah I know, it's close. You need to repeat the benchmarks a few times, like I did, maybe try taking an average over 5 or so runs. – Sam Jun 13 '14 at 19:44
  • I added my code in a thread too and synced the number of iterations and tests. My code still yields almost identical results! (speeds are the same in every case) And of course, I run the benchmarks multiple times. – Vercas Jun 13 '14 at 19:45
  • I did, I told you in the first comment under this answer. And you are right, with your code, the stack is faster. The test case only takes 60%-65% of the time of other test cases. **Edit**: I swear something is wrong with your code. You allocate 50 MB to the stack and use all 50 MB with one array! Where the heck does the rest of the stack data go?! – Vercas Jun 13 '14 at 19:47
  • Sorry, I must have missed that (speed reading). Hmm, well it's certainly turning out to be an interesting issue; let me know your results when you get them ;) – Sam Jun 13 '14 at 19:49
  • @Sam I found the difference. You allocate memory on every iteration, I allocate the same memory on every test for all iterations. So, accessing the memory is just as fast wherever it is. Allocating it, though, is faster on the stack. There, mystery solved. – Vercas Jun 13 '14 at 19:56
  • @Sam I added all the information to my answer now. 'Tis now the only true answer to the question. – Vercas Jun 13 '14 at 20:05
  • The problem with the benchmark here is that it doesn't really take care of the JIT. You need some startup time to make sure all methods are compiled before measuring time, otherwise the results can easily vary by a good deal depending on runtime and PC. Also it should be noted that on my PC I get a stackoverflow exception when trying with `12500000` floats - stacks aren't moved afaik, it's just a question of whether the initial commit size is large enough or not. – Voo Jun 14 '14 at 21:19
  • @Voo Are you referring to Vercas' benchmark or mine? – Sam Jun 14 '14 at 21:48
  • @Sam Neither code seems to account for a separate warmup time before benchmarking as far as I can see. – Voo Jun 14 '14 at 22:17
  • @Voo Even averaging the test over 20 runs won't allow for an adequate warm up? So what would you recommend? – Sam Jun 14 '14 at 22:20
  • @Sam Basically you want to first call the function a few times *before* you ever start measuring time, but there are other things to take care of too. [See here](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) for a discussion on Java, which is for most of it just as applicable to C# (sadly I don't know how to convert the advice to C# itself, other flags and all). Jon's framework is a good start though and even results in way less code for you to write yourself (I'd change the defaults a bit though). – Voo Jun 14 '14 at 22:25
  • 1
    @Voo The 1st run took as much time as the 100th run of any test for me. From my experience, this Java JIT thing does not apply to .NET at all. The only "warm up" that .NET does is loading classes and assemblies when used for the first time. – Vercas Jun 14 '14 at 22:26
  • @Vercas Well there is the GAC but clearly that's not applicable here. And since the CLR also uses a JIT (no idea where you get the idea it doesn't, or that it compiles everything immediately?) just as java this is just as applicable. The code by Sam actually gives the perfect example: The 500% performance difference is exactly due to not taking care of warmup. How much effect that has (which also varies from PC to PC) is certainly hard to say and if you run long enough you amortize it - but why write a flawed benchmark and then reason about it instead of avoiding the problems to begin with? – Voo Jun 14 '14 at 22:31
  • 2
    @Voo Test my benchmark and the one from the gist he added in a comment to this answer. Assemble the codes together and run a few hundred tests. Then come back and report your conclusion. I've done my tests very thoroughly, and I know very well what I am talking about when saying that .NET does not interpret any bytecode like Java does, it JITs it instantly. – Vercas Jun 14 '14 at 22:34
  • @Vercas You are absolutely right, mea culpa. Exceedingly surprising considering the possible advantages without much downsides but a quick and simple test definitely shows that you're right. – Voo Jun 14 '14 at 22:46
  • @Vercas Changed my answer to avoid confusing more people, thanks for this newly won knowledge. It is dangerous to assume that the CLR will work just as HotSpot just because they are generally so similar on a high enough level. And hey, the nice thing about CS is that you can often easily proof whether someone's right or wrong, it'd be way more embarrassing to argue a point that I knew is wrong, wouldn't it? :-) – Voo Jun 14 '14 at 22:51
  • @Voo Indeed. Even though their purposes are similar, they approach the problem differently. – Vercas Jun 14 '14 at 22:57
  • I don't get it. So this answer confirms that the 530% improvement exists and then goes on to say that the only danger is about the possibility of throwing StackOverflowError? – justhalf Jun 18 '14 at 01:44
  • @justhalf It confirms that a performance improvement may occur because of allocation and presents possible risks of doing so, amongst which the most obvious is stack memory exhaustion. – Vercas Jun 19 '14 at 06:40
28

I've discovered a 530% increase in processing speed!

That's by far the biggest danger I'd say. There's something seriously wrong with your benchmark, code that behaves this unpredictably usually has a nasty bug hidden somewhere.

It is very, very difficult to consume a lot of stack space in a .NET program, other than by excessive recursion. The size of the stack frame of managed methods are set in stone. Simply the sum of the arguments of the method and the local variables in a method. Minus the ones that can be stored in a CPU register, you can ignore that since there are so few of them.

Increasing the stack size doesn't accomplish anything, you'll just reserve a bunch of address space that will never be used. There is no mechanism that can explain a perf increase from not using memory of course.

This is unlike a native program, particularly one written in C, it can also reserve space for arrays on the stack frame. The basic malware attack vector behind stack buffer overflows. Possible in C# as well, you'd have to use the stackalloc keyword. If you are doing that then the obvious danger is having to write unsafe code that is subject to such attacks, as well as random stack frame corruption. Very hard to diagnose bugs. There is a counter-measure against this in later jitters, I think starting at .NET 4.0, where the jitter generates code to put a "cookie" on the stack frame and checks if it is still intact when the method returns. Instant crash to the desktop without any way to intercept or report the mishap if that happens. That's ... dangerous to the user's mental state.

The main thread of your program, the one started by the operating system, will have a 1 MB stack by default, 4 MB when you compile your program targeting x64. Increasing that requires running Editbin.exe with the /STACK option in a post build event. You can typically ask for up to 500 MB before your program will have trouble getting started when running in 32-bit mode. Threads can too, much easier of course, the danger zone typically hovers around 90 MB for a 32-bit program. Triggered when your program has been running for a long time and address space got fragmented from previous allocations. Total address space usage must already be high, over a gig, to get this failure mode.

Triple-check your code, there's something very wrong. You can't get a x5 speedup with a bigger stack unless you explicitly write your code to take advantage of it. Which always requires unsafe code. Using pointers in C# always has a knack for creating faster code, it isn't subjected to the array bounds checks.

Kerem Baydoğan
  • 10,475
  • 1
  • 43
  • 50
Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • 21
    The 5x speedup reported was from moving from `float[]` to `float*`. The large stack was simply how that was accomplished. A x5 speedup in some scenarios is entirely reasonable for that change. – Marc Gravell Jun 13 '14 at 09:47
  • 3
    Okay, I didn't have the code snippet yet when I started answering the question. Still close enough. – Hans Passant Jun 13 '14 at 09:55
22

I would have a reservation there that I simply wouldn't know how to predict it - permissions, GC (which needs to scan the stack), etc - all could be impacted. I would be very tempted to use unmanaged memory instead:

var ptr = Marshal.AllocHGlobal(sizeBytes);
try
{
    float* x = (float*)ptr;
    DoWork(x);
}
finally
{
    Marshal.FreeHGlobal(ptr);
}
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • 1
    Side question: Why would the GC need to scan the stack? Memory allocated by `stackalloc` is not subject to garbage collection. – dcastro Jun 13 '14 at 09:23
  • 6
    @dcastro it needs to scan the stack to check for references that only exist on the stack. I simply don't know what it is going to do when it gets to such a huge `stackalloc` - it kinda needs to jump it, and you'd hope it would do so effortlessly - but the point I am trying to make is that it introduces *unnecessary* complications / concerns. IMO, `stackalloc` is great as a scratch-buffer, but for a dedicated workspace, it is more expected to just allocate a chunk-o-memory somewhere, rather than abusing / confusing the stack, – Marc Gravell Jun 13 '14 at 09:44
8

One thing that can go wrong is that you might not get the permission to do so. Unless running in full-trust mode, the Framework will just ignore the request for a larger stack size (see MSDN on Thread Constructor (ParameterizedThreadStart, Int32))

Instead of increasing the system stack size to such huge numbers, I would suggest to rewrite your code so that it uses Iteration and a manual stack implementation on the heap.

AnthonyLambert
  • 8,768
  • 4
  • 37
  • 72
PMF
  • 14,535
  • 3
  • 23
  • 49
  • 1
    Good idea, I'll iterate through instead. Besides that, my code is running in full trust mode, so are there any other things I should look out for? – Sam Jun 13 '14 at 08:47
6

The high performant arrays might be accessible in the same way as a normal C# one but that could be the beginning of trouble: Consider the following code:

float[] someArray = new float[100]
someArray[200] = 10.0;

You expect an out of bound exception and this completely makes sense because you are trying to access element 200 but the maximum allowed value is 99. If you go to the stackalloc route then there will be no object wrapped around your array to bound check and the following will not show any exception:

Float* pFloat =  stackalloc float[100];
fFloat[200]= 10.0;

Above you are allocating enough memory to hold 100 floats and you are setting the sizeof(float) memory location which starts at the location started of this memory + 200*sizeof(float) for holding your float value 10. Unsurprisingly this memory is outside the allocated memory for the floats and nobody would know what could be stored in that address. If you are lucky you might have used some currently unused memory but at the same time it is likely you might overwrite some location that was used for storing other variables. To Summarize: Unpredictable runtime behaviour.

MHOOS
  • 5,146
  • 11
  • 39
  • 74
  • Factually wrong. The runtime and compiler tests are still there. – TomTom Jun 13 '14 at 09:00
  • 9
    @TomTom erm, no; the answer has merit; the question talks about `stackalloc`, in which case we're talking about `float*` etc - which does not have the same checks. It is called `unsafe` for a very good reason. Personally I'm perfectly happy to use `unsafe` when there is a good reason, but Socrates makes some reasonable points. – Marc Gravell Jun 13 '14 at 09:03
  • @Marc For the shown code (after the JIT is run) there are no more bounds checks because it's trivial for the compiler to reason that all accesses are in-bounds. In general though this can certainly make a difference. – Voo Jun 14 '14 at 21:17
6

Microbenchmarking languages with JIT and GC such as Java or C# can be a bit complicated, so it's generally a good idea to use an existing framework - Java offers mhf or Caliper which are excellent, sadly to the best of my knowledge C# doesn't offer anything approaching those. Jon Skeet wrote this here which I'll blindly assume takes care of the most important things (Jon knows what he's doing in that area; also yes no worries I did actually check). I tweaked the timing a bit because 30 seconds per test after warmup was too much for my patience (5 seconds ought to do).

So first the results, .NET 4.5.1 under Windows 7 x64 -- the numbers denote the iterations it could run in 5 seconds so higher is better.

x64 JIT:

Standard       10,589.00  (1.00)
UnsafeStandard 10,612.00  (1.00)
Stackalloc     12,088.00  (1.14)
FixedStandard  10,715.00  (1.01)
GlobalAlloc    12,547.00  (1.18)

x86 JIT (yeah that's still kind of sad):

Standard       14,787.00   (1.02)
UnsafeStandard 14,549.00   (1.00)
Stackalloc     15,830.00   (1.09)
FixedStandard  14,824.00   (1.02)
GlobalAlloc    18,744.00   (1.29)

This gives a much more reasonable speedup of at most 14% (and most of the overhead is due to the GC having to run, consider it a worst case scenario realistically). The x86 results are interesting though - not entirely clear what's going on there.

and here's the code:

public static float Standard(int size) {
    float[] samples = new float[size];
    for (var ii = 0; ii < size; ii++) {
        samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
    }
    return samples[size - 1];
}

public static unsafe float UnsafeStandard(int size) {
    float[] samples = new float[size];
    for (var ii = 0; ii < size; ii++) {
        samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
    }
    return samples[size - 1];
}

public static unsafe float Stackalloc(int size) {
    float* samples = stackalloc float[size];
    for (var ii = 0; ii < size; ii++) {
        samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
    }
    return samples[size - 1];
}

public static unsafe float FixedStandard(int size) {
    float[] prev = new float[size];
    fixed (float* samples = &prev[0]) {
        for (var ii = 0; ii < size; ii++) {
            samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
        }
        return samples[size - 1];
    }
}

public static unsafe float GlobalAlloc(int size) {
    var ptr = Marshal.AllocHGlobal(size * sizeof(float));
    try {
        float* samples = (float*)ptr;
        for (var ii = 0; ii < size; ii++) {
            samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
        }
        return samples[size - 1];
    } finally {
        Marshal.FreeHGlobal(ptr);
    }
}

static void Main(string[] args) {
    int inputSize = 100000;
    var results = TestSuite.Create("Tests", inputSize, Standard(inputSize)).
        Add(Standard).
        Add(UnsafeStandard).
        Add(Stackalloc).
        Add(FixedStandard).
        Add(GlobalAlloc).
        RunTests();
    results.Display(ResultColumns.NameAndIterations);
}
Voo
  • 29,040
  • 11
  • 82
  • 156
  • An interesting observation, I'll have to check my benchmarks again. Though this still doesn't really answer my question, "*...what are the dangers associated with increasing the stack to such a large size...*". Even if my results are incorrect, the question is still valid; I appreciate the effort nevertheless. – Sam Jun 14 '14 at 21:39
  • 1
    @Sam When using `12500000` as size I actually get a stackoverflow exception. But mostly this was about rejecting the underlying premise that using stack allocated code is several orders of magnitudes quicker. We're doing pretty much the least amount of work possible here otherwise and the difference is already only about 10-15% - in practice it will be even lower.. this in my opinion definitely changes the whole discussion. – Voo Jun 14 '14 at 22:21
5

Since the performance difference is too huge, the problem is barely related to allocation. It is likely caused by the array access.

I disassembled the loop body of the functions:

TestMethod1:

IL_0011:  ldloc.0 
IL_0012:  ldloc.1 
IL_0013:  ldc.i4.4 
IL_0014:  mul 
IL_0015:  add 
IL_0016:  ldc.r4 32768.
IL_001b:  stind.r4 // <----------- This one
IL_001c:  ldloc.1 
IL_001d:  ldc.i4.1 
IL_001e:  add 
IL_001f:  stloc.1 
IL_0020:  ldloc.1 
IL_0021:  ldc.i4 12500000
IL_0026:  blt IL_0011

TestMethod2:

IL_0012:  ldloc.0 
IL_0013:  ldloc.1 
IL_0014:  ldc.r4 32768.
IL_0019:  stelem.r4 // <----------- This one
IL_001a:  ldloc.1 
IL_001b:  ldc.i4.1 
IL_001c:  add 
IL_001d:  stloc.1 
IL_001e:  ldloc.1 
IL_001f:  ldc.i4 12500000
IL_0024:  blt IL_0012

We can check the usage of the instruction and more importantly, the exception they throw in ECMA spec:

stind.r4: Store value of type float32 into memory at address

Exceptions it throws:

System.NullReferenceException

And

stelem.r4: Replace array element at index with the float32 value on the stack.

Exception it throws:

System.NullReferenceException
System.IndexOutOfRangeException
System.ArrayTypeMismatchException

As you can see, stelem does more work in array range checking and type checking. Since the loop body does little thing (only assign value), the overhead of the checking dominates the computation time. So that’s why the performance differs by 530%.

And this also answers your questions: the danger is the absent of array range & type checking. This is unsafe (as mentioned in the function declaration ;D).

HKTonyLee
  • 3,111
  • 23
  • 34
4

EDIT: (small change in code and in measuring produces big change in the outcome)

Firstly I ran the optimized code in debugger (F5) but that was wrong. It should be run without the debugger (Ctrl+F5). Second, the code may be thoroughly optimized, so we must complicate it so that the optimizer does not messes with our measuring. I made all methods return a last item in the array, and the array is populated differently. Also there is an extra zero in OP's TestMethod2 that always makes it ten times slower.

I tried some other methods, in addition to the two that you provided. Method 3 has the same code as your method 2, but the function is declared unsafe. Method 4 is using pointer access to regularly created array. Method 5 is using pointer access to unmanaged memory, as described by Marc Gravell. All five methods run in very similar times. M5 is the fastest (and M1 is close second). The difference between the fastest and the slowest is some 5%, which is not something I would care about.

    public static unsafe float TestMethod3()
    {
        float[] samples = new float[5000000];

        for (var ii = 0; ii < 5000000; ii++)
        {
            samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
        }

        return samples[5000000 - 1];
    }

    public static unsafe float TestMethod4()
    {
        float[] prev = new float[5000000];
        fixed (float* samples = &prev[0])
        {
            for (var ii = 0; ii < 5000000; ii++)
            {
                samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
            }

            return samples[5000000 - 1];
        }
    }

    public static unsafe float TestMethod5()
    {
        var ptr = Marshal.AllocHGlobal(5000000 * sizeof(float));
        try
        {
            float* samples = (float*)ptr;

            for (var ii = 0; ii < 5000000; ii++)
            {
                samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
            }

            return samples[5000000 - 1];
        }
        finally
        {
            Marshal.FreeHGlobal(ptr);
        }
    }
Dialecticus
  • 16,400
  • 7
  • 43
  • 103
  • So M3 is the same as M2 only marked with "unsafe"? Rather suspicious that it would be any faster... are you sure? – Roman Starkov Jun 13 '14 at 11:01
  • @romkyns I've just ran a benchmark (M2 vs M3), and surprisingly M3 is actually 2.14% faster than M2. – Sam Jun 13 '14 at 11:08
  • "*The conclusion is that using the stack is not needed.*" when allocating large blocks such as I gave in my post I agree, but, after having just completed some more benchmarks [M1 vs M2](https://gist.github.com/ArcticWinter/c1819b9a05e338fa403e#file-stack-vs-heap-benchmark) (using [PFM's idea](http://stackoverflow.com/a/24201079/2246344) for both methods) I would certainly have to disagree, as M1 is now 135% faster than M2. – Sam Jun 13 '14 at 11:51
  • 1
    @Sam But you're still comparing pointer access to array access! *THAT* is primarly what makes it faster. `TestMethod4` vs `TestMethod1` is a much better comparison for `stackalloc`. – Roman Starkov Jun 13 '14 at 12:02
  • @romkyns Ah yes good point, I forgot about that; I've re-ran the [benchmarks](https://gist.github.com/ArcticWinter/c1819b9a05e338fa403e#file-stack-vs-heap-benchmark), there's only a 8% difference now (M1 being the faster of the two). – Sam Jun 13 '14 at 12:16
  • @Sam I changed the answer. Now there's almost no difference between all five tests. – Dialecticus Jun 13 '14 at 12:21