4

It may be the case that my hardware is the culprit, but during testing, I've found that:

void SomeFunction(AType ofThing) {
    DoSomething(ofThing);
}

...is faster than:

private AType _ofThing;
void SomeFunction() {
    DoSomething(_ofThing);
}

I believe it has to do with how the compiler translates this to CIL. Could anyone please explain, specifically, why does this happen?

Here's some code where it happens:

public void TestMethod1()
{

    var stopwatch = new Stopwatch();

    var r = new int[] { 1, 2, 3, 4, 5 };
    var i = 0;
    stopwatch.Start();
    while (i < 1000000)
    {
        DoSomething(r);
        i++;
    }

    stopwatch.Stop();

    Console.WriteLine(stopwatch.ElapsedMilliseconds);

    i = 0;
    stopwatch.Restart();
    while (i < 1000000)
    {
        DoSomething();
        i++;
    }

    stopwatch.Stop();

    Console.WriteLine(stopwatch.ElapsedMilliseconds);
}

private void DoSomething(int[] arg1)
{
    var r = arg1[0] * arg1[1] * arg1[2] * arg1[3] * arg1[4];
}

private int[] _arg1 = new [] { 1, 2, 3, 4, 5 };
private void DoSomething()
{
    var r = _arg1[0] * _arg1[1] * _arg1[2] * _arg1[3] * _arg1[4];
}

In my case it is 2.5x slower to use a private property.

6 Answers6

9

I believe it has to do with how the compiler translates this to CIL.

Not really. Performance doesn't directly depend on the CIL code, because that's not what's actually executed. What's executed is the JITed native code, so you should look at that when you're interested in performance.

So, let's look the the code generated for the DoSomething(int[]) loop:

mov         eax,dword ptr [ebx+4] ; get the length of the array
cmp         eax,0       ; if it's 0
jbe         0000018C    ; jump to code that throws IndexOutOfRangeException
cmp         eax,1       ; if it's 1, etc.
jbe         0000018C 
cmp         eax,2 
jbe         0000018C 
cmp         eax,3 
jbe         0000018C 
cmp         eax,4 
jbe         0000018C 
inc         esi         ; i++
cmp         esi,0F4240h ; if i < 1000000
jl          000000B7    ; loop again

What's interesting about this code is that there is no useful work done at all, most of the code is array bounds checking (why the code hasn't been optimized to perform this checking only once before the loop, I have no idea).

Also notice that the code is inlined, you're not paying the cost of a function call.

This code takes around 1.7 ms on my computer.

So, how does the loop for DoSomething() look like?

mov         ecx,dword ptr [ebp-10h]  ; access this
call        dword ptr ds:[001637F4h] ; call DoSomething()
inc         esi                      ; i++
cmp         esi,0F4240h              ; if i < 1000000
jl          00000120                 ; loop again

Okay, so this actually calls the method, no inlining this time. What does the method itself look like?

mov         eax,dword ptr [ecx+4] ; access this._arg1
cmp         dword ptr [eax+4],0   ; if its length is 0
jbe         00000022 ; jump to code that throws IndexOutOfRangeException
cmp         dword ptr [eax+4],1   ; etc.
jbe         00000022 
cmp         dword ptr [eax+4],2 
jbe         00000022 
cmp         dword ptr [eax+4],3 
jbe         00000022 
cmp         dword ptr [eax+4],4 
jbe         00000022 
ret                               ; bounds checks successful, return

Comparing with the previous version (and ignoring the overhead of the function call for now), this does three different memory accesses instead of just one, which could explain some of the performance difference. (I think the five accesses to eax+4 should be counted only as one, because otherwise the compiler would optimize them.)

This code runs in about 3.0 ms for me.

How much overhead does the method call take? We can check that by adding [MethodImpl(MethodImplOptions.NoInlining)] to the previously inlined DoSomething(int[]). The assembly now looks like this:

mov         ecx,dword ptr [ebp-10h]  ; access this
mov         edx,dword ptr [ebp-14h]  ; access r
call        dword ptr ds:[002937E8h] ; call DoSomething(int[])
inc         esi                      ; i++
cmp         esi,0F4240h              ; if i < 1000000
jl          000000A0                 ; loop again

Notice that r is now no longer kept in a register, it's instead on the stack, which will add another slowdown.

Now DoSomething(int[]):

push        ebp                   ; save ebp from caller to stack
mov         ebp,esp               ; write our own ebp
mov         eax,dword ptr [edx+4] ; read the length of the array
cmp         eax,0    ; if it's 0
jbe         00000021 ; jump to code that throws IndexOutOfRangeException
cmp         eax,1    ; etc.
jbe         00000021 
cmp         eax,2 
jbe         00000021 
cmp         eax,3 
jbe         00000021 
cmp         eax,4 
jbe         00000021 
pop         ebp      ; restore ebp
ret                  ; return

This code runs in about 3.2 ms for me. That's even slower than DoSomething(). What's going on?

Turns out, [MethodImpl(MethodImplOptions.NoInlining)] seems to cause those unnecessary ebp instructions. If I add that attribute to DoSomething(), it runs in 3.3 ms.

This means the difference between stack access and heap access is pretty small (but still measurable). The fact that the array pointer could be kept in a register when the method was inlined was probably more significant.


So, the conclusion is that the big difference you're seeing is because of inlining. The JIT compiler decided inline the code for DoSomething(int[]), but not for DoSomething(), which allowed the code for DoSomething(int[]) to be very efficient. The most likely reason for that is because the IL for DoSomething() is much longer (21 bytes vs. 46 bytes).

Also, you're not really measuring what you wrote (array accesses and multiplications), because that could be optimized out. So be careful with devising your microbenchmarks, so that the compiler can't ignore the code you actually wanted to measure.

svick
  • 236,525
  • 50
  • 385
  • 514
  • Out of curiosity, which JIT compiler did you use? – kvb Nov 19 '13 at 23:05
  • @kvb I have .Net 4.5 and I didn't change any settings in the project (apart from switching to Release mode). – svick Nov 19 '13 at 23:12
  • x86, I assume? Do you see similar perf differences on x64? – kvb Nov 20 '13 at 16:04
  • @kvb Like I said, I didn't change any settings, which means Any CPU 32-bit preferred. So, yeah, this was x86. I didn't check x64. – svick Nov 20 '13 at 16:07
5

Several people have made a stack/heap distinction, but this is a false dichotomy; when the IL is compiled to machine code there are additional possibilities, such as passing arguments in registers, which is potentially even faster than getting them off of the stack. See Eric Lippert's great blog post The Truth About Value Types for more thoughts along these lines. In any case, a proper analysis of the performance difference will almost certainly require looking at the generated machine code, not at the IL, and will potentially depend on the version of the JIT compiler, etc.

kvb
  • 54,864
  • 2
  • 91
  • 133
3

If that is your example, I would not be surprised to see that SomeFunction is being inlined. See Here

It is also entirely possible that the JIT isn't able to inline the second example.

You would need to look at the compiled code to prove this. I don't know of a deterministic way to know if something is inlined bar looking at the compiled code.

You could at least disprove caching by having another thread write to _ofThing and if you get similar results, while it is changing the read value, then it wouldn't be caching.

Community
  • 1
  • 1
Meirion Hughes
  • 24,994
  • 12
  • 71
  • 122
2

Even if function is not inlined, referencing an arg can be faster because of cache locality: the arg is already in CPU cache.

It's worth to note that you put it in cache by calling this function so already paid this price.

Andriy Tylychko
  • 15,967
  • 6
  • 64
  • 112
  • your point re having already paid the price is a very good one (it's a zero sum game). It ought to be possible for the OP to test this. – TooTone Nov 19 '13 at 15:29
  • 1
    That shouldn't cause the difference here, for at least 999999 iterations, the array will be cached in both versions. – svick Nov 19 '13 at 21:16
0

This is totally related to where your variable is stored. If it is on the Stack or on the heap. The following code is much faster because it is using a static variable for example :

private static AType _ofThing;
void SomeFunction() {
    DoSomething(_ofThing);
}

For more information about where variables are stored, please have a look at this excellent answer from Hans Passant

Community
  • 1
  • 1
Moslem Ben Dhaou
  • 6,897
  • 8
  • 62
  • 93
  • Thanks for your comment, but this is not faster. – AStackOverflowUser Nov 19 '13 at 14:09
  • 1
    Why should accessing a variable on the stack be any faster than accessing a variable on the heap? – svick Nov 19 '13 at 15:00
  • 1
    Also, where are you saying that `static` variable is stored? (It's certainly not on the stack, and it's also not part of the normal managed heap.) – svick Nov 19 '13 at 15:07
  • Static variable is stored on the heap. This heap is separate from the normal garbage collected heap - it's known as a "high frequency heap", and there's one per application domain. Have a look here http://www.codeproject.com/Articles/15269/Static-Keyword-Demystified – Moslem Ben Dhaou Nov 21 '13 at 12:04
  • http://stackoverflow.com/questions/337019/hows-memory-allocated-for-a-static-variable – Moslem Ben Dhaou Nov 21 '13 at 12:04
-1

When you call a method using their parameter, you are using a stack memory and when are using a global variable you are using a heap memory.

Stack

  • very fast access
  • don't have to explicitly de-allocate variables
  • space is managed efficiently by CPU, memory will not become fragmented
  • local variables only
  • limit on stack size (OS-dependent)
  • variables cannot be resized

Heap

  • variables can be accessed globally
  • no limit on memory size
  • (relatively) slower access
  • no guaranteed efficient use of space, memory may become fragmented over time as blocks of memory are allocated, then freed
  • variables can be resized

http://tutorials.csharp-online.net/Stack_vs._Heap

Andrew Paes
  • 1,940
  • 1
  • 15
  • 20
  • 1
    Be careful using the term "Global Variable"... its technically untrue. You have static member variables of classes. – Meirion Hughes Nov 19 '13 at 14:13
  • 1
    Ok, you mention some differences between stack and heap in native code (several of those are not valid in .Net). How does that answer the question? Why is access to heap slower than to stack? – svick Nov 19 '13 at 15:05
  • 1
    @svick Agreed. I feel this answer is in danger of being quite misleading. The only thing about the stack that is likely to make it faster compared to the heap is **locality of reference**. I.e., because the stack is densely populated with variables you're looking at right now, it's much more likely those variables will be in cache (if they're not passed in a register as another answer suggests). – TooTone Nov 19 '13 at 15:35
  • @TooTone That's probably not relevant here, because that part of the heap will be loaded into the cache on the first iteration and it will stay there. But there's another reason: accessing the heap usually means one more pointer dereference. I plan to elaborate on this in an answer of my own when I get home. – svick Nov 19 '13 at 15:39
  • @svick Sorry I was referring to the original question without the loop. In any case I look forward to seeing your answer later! – TooTone Nov 19 '13 at 16:20