14

I was writing a few benchmarking tests to figure out why a similar pure algorithm (no C++ lib / .net built in classes) ran much faster in C++ than in C#, even when accounting for the expected feature différences. And while doing so i stumbled on these 2 tests that baffled me, does anyone have an idea about why one is substantially slower thant he other? The only difference in the 2nd one (that takes 51ms vs 88 on my machine) is that the 2 arrays are declared locally in the method instead of outside. In both cases the arrays are created before we start timing.

    const int Runs = 100;
    const int Width = 5000;
    const int Height = 5000;
    const int Size = Width * Height;


    static int[] Input = Enumerable.Range(0, Size).ToArray();
    static int[] Output = new int[Size * 2];

    static int SimpleTest()
    {
        // Removing those 2 lines and using the static arrays instead give substantially slower performance, nearly half the speed!
        int[] Input = Enumerable.Range(0, Size).ToArray();
        int[] Output = new int[Size * 2];

        Stopwatch sw = new Stopwatch();
        sw.Start();

        for (int run = 0; run < Runs; run++)
        {
            int InputIndex = 0;
            for (int x = 0; x < Width; x++)
            {
                for (int y = 0; y < Height; y++)
                {
                    int pixel = Input[InputIndex];
                    var OutputIndex = InputIndex * 2;
                    Output[OutputIndex] = pixel;
                    Output[OutputIndex + 1] = pixel;
                    InputIndex++;
                }
            }
        }
        sw.Stop();
        return (int)(sw.ElapsedMilliseconds / Runs);
    }
Ronan Thibaudau
  • 3,413
  • 3
  • 29
  • 78
  • 1
    Interesting find... I'm sure there is a reasonable explanation. I'm thinking that it has to do with the way you're accessing those variables. While inside the method, they're considered to be local to that method, so I'm thinking that the access speed is a lot faster than having to access the global variables. Someone like @JonSkeet, who's familiar with the internals more than any mortal, would probably have an explanation. – B.K. May 13 '15 at 05:03
  • Could it be because of local and global scope thing? I mean when code is stored in memory, global variables are placed in different location than local variables for a function. – Abhishek May 13 '15 at 05:09
  • @Abhishek Could be... I believe that static variables are stored on the heap, but those variables inside the static method... are they stored on the stack or heap? – B.K. May 13 '15 at 05:10
  • @B.K. upto my knowledge it is stack – Abhishek May 13 '15 at 05:20
  • This is a 2.5 million elements array, i'm pretty sure it's not stored on the stack – Ronan Thibaudau May 13 '15 at 05:22
  • @RonanThibaudau I'm not saying values will be stored in stack. I'm saying reference of the variables :) – Abhishek May 13 '15 at 05:23

3 Answers3

15

When the variables are local, the compiler knows that Input and Output will never change, which opens up many optimizations.

  • The value of the Input and Output variables can be kept in registers.
  • Input.Length and Output.Length can be calculated once and cached.
  • The compiler can prove that Input[InputIndex] and Output[OutputIndex] will never result in an array index out of bounds, so the bounds check can be optimized out.
  • The compiler can observe that the results of Input and Output are never used, so it could optimize the loops to nothing!

If you use the static versions, then the compiler cannot perform these optimizations. The compiler must reload Input and Output at each access, and must perform a bounds check at every array index operation, just in case another thread modified Input or Output.

For example, if another thread does Input = new int[Size], then all future calculations must proceed with this alternate Input. And if another thread does Output = new int[1], then the code must raise an IndexOutOfRangeException.

Raymond Chen
  • 44,448
  • 11
  • 96
  • 135
  • Hmmm those are fair points but i don't think any of those apply – Ronan Thibaudau May 13 '15 at 05:19
  • 1
    Input and output are fairly large arrays, so they still have to be in memory. Neither input nor output . Length are getting used there. The bounds check can't be excluded because as far as i know the C# compiler only does it when you explcitely use .lenght (i don't here). And it could optimize the loops away but it clearly doesn't (else we would have 0MS, not 50 instead of 80) – Ronan Thibaudau May 13 '15 at 05:20
  • 2
    I'm seeing the 64-bit JIT cache `Input` in the `r12` register and `Output` in the `rsi` register. It also hard-codes the values of `Input.Length` and `Output.Length` (e.g. `cmp rax, 2faf080h`) to optimize the bounds checks. – Raymond Chen May 13 '15 at 05:38
  • Ah so it still compares them with my index (instead of with index.length) while bound checking but doesn't remove bounds check completely (unlike if i was doing x – Ronan Thibaudau May 13 '15 at 05:41
  • 3
    My original answer was describing optimizations available to the compiler. Which ones it actually uses was a matter of experimentation. The optimizations in practice may vary with your compiler options or which version of the compiler you're using. Note also that the compiler disables certain optimization when you have a debugger attached. – Raymond Chen May 13 '15 at 05:47
  • Any reason why the results in my answer (posted elsewhere on this page), are the opposite, where static is faster? – Nic Foster May 13 '15 at 05:52
  • 2
    @NicFoster You tested with Mono. I tested with Microsoft's compiler. By the way, here's [a blog entry on array bounds check optimizations](http://blogs.msdn.com/b/clrcodegeneration/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx). The section on "Arrays in global locations" seems relevant here. – Raymond Chen May 13 '15 at 05:56
  • Under ECMA the JIT is allowed to enregister the statics in any case. The threading model is extremely weak under ECMA. It's a JIT limitation that it doesn't do this. – usr May 13 '15 at 13:06
5

With the 32-bit JIT I believe the culprit is, as Raymond Chen mentionned, that Input and Output can be kept in registers when they are locals, but they need to be reloaded every time when they are not. Generated assembly:

For locals:

007426F0  mov         eax,dword ptr [ebp-18h]  
007426F3  mov         edi,dword ptr [eax+4]  
                        int pixel = Input[InputIndex];
007426F6  mov         eax,dword ptr [ebp-18h]  
007426F9  cmp         edx,edi  
007426FB  jae         0074276E  
007426FD  mov         ecx,dword ptr [eax+edx*4+8] 

For statics:

011C2718  mov         dword ptr [ebp-18h],edx  
011C271B  mov         esi,dword ptr ds:[3BB7E90h]  
011C2721  mov         eax,dword ptr [esi+4]  
011C2724  mov         dword ptr [ebp-1Ch],eax  
                        int pixel = Input[InputIndex];
011C2727  mov         eax,dword ptr [ebp-1Ch]  
011C272A  cmp         ecx,eax  
011C272C  jae         011C27A2  
011C272E  mov         edi,dword ptr [esi+ecx*4+8] 

As you can see, mov esi,dword ptr ds:[3BB7E90h] accesses the data segment. As you can also see, bounds checking occur in both cases (cmp-jae) so that's irrelevant, and the loops are not in fact optimized to nothing.

How the 64-bit JIT avoids the issue is beyond me.

Here's the complete dissassembly for both cases:

Fast version:

               for (int x = 0; x < Width; x++) {
007426EB  mov         dword ptr [ebp-14h],edx  
                    for (int y = 0; y < Height; y++) {
007426EE  xor         ebx,ebx  
007426F0  mov         eax,dword ptr [ebp-18h]  
007426F3  mov         edi,dword ptr [eax+4]  
                        int pixel = Input[InputIndex];
007426F6  mov         eax,dword ptr [ebp-18h]  
007426F9  cmp         edx,edi  
007426FB  jae         0074276E  
007426FD  mov         ecx,dword ptr [eax+edx*4+8]  
                        var OutputIndex = InputIndex * 2;
00742701  mov         esi,edx  
00742703  add         esi,esi  
                        Output[OutputIndex] = pixel;
00742705  mov         eax,dword ptr [ebp-1Ch]  
00742708  cmp         esi,dword ptr [eax+4]  
0074270B  jae         0074276E  
0074270D  mov         dword ptr [eax+esi*4+8],ecx  
                        Output[OutputIndex + 1] = pixel;
00742711  inc         esi  
00742712  mov         eax,dword ptr [ebp-1Ch]  
00742715  cmp         esi,dword ptr [eax+4]  
00742718  jae         0074276E  
0074271A  mov         dword ptr [eax+esi*4+8],ecx  
                        InputIndex++;
0074271E  inc         edx  
                    for (int y = 0; y < Height; y++) {
0074271F  inc         ebx  
                    for (int y = 0; y < Height; y++) {
00742720  cmp         ebx,1388h  
00742726  jl          007426F6  
                for (int x = 0; x < Width; x++) {
00742728  inc         dword ptr [ebp-14h]  
0074272B  cmp         dword ptr [ebp-14h],1388h  
00742732  jl          007426EE 

Slow version:

                for (int x = 0; x < Width; x++) {
011C2713  mov         dword ptr [ebp-14h],ecx  
                    for (int y = 0; y < Height; y++) {
011C2716  xor         edx,edx  
011C2718  mov         dword ptr [ebp-18h],edx  
011C271B  mov         esi,dword ptr ds:[3BB7E90h]  
011C2721  mov         eax,dword ptr [esi+4]  
011C2724  mov         dword ptr [ebp-1Ch],eax  
                        int pixel = Input[InputIndex];
011C2727  mov         eax,dword ptr [ebp-1Ch]  
011C272A  cmp         ecx,eax  
011C272C  jae         011C27A2  
011C272E  mov         edi,dword ptr [esi+ecx*4+8]  
                        var OutputIndex = InputIndex * 2;
011C2732  mov         ebx,ecx  
011C2734  add         ebx,ebx  
                        Output[OutputIndex] = pixel;
011C2736  mov         edx,dword ptr ds:[3BB7E94h]  
011C273C  cmp         ebx,dword ptr [edx+4]  
011C273F  jae         011C27A2  
011C2741  mov         dword ptr [edx+ebx*4+8],edi  
                        Output[OutputIndex + 1] = pixel;
011C2745  inc         ebx  
011C2746  cmp         ebx,dword ptr [edx+4]  
011C2749  jae         011C27A2  
011C274B  mov         dword ptr [edx+ebx*4+8],edi  
                        InputIndex++;
011C274F  inc         ecx  
                    for (int y = 0; y < Height; y++) {
011C2750  inc         dword ptr [ebp-18h]  
011C2753  cmp         dword ptr [ebp-18h],1388h  
011C275A  jl          011C2727  
                for (int x = 0; x < Width; x++) {
011C275C  inc         dword ptr [ebp-14h]  
011C275F  cmp         dword ptr [ebp-14h],1388h  
011C2766  jl          011C2716  
Asik
  • 21,506
  • 6
  • 72
  • 131
2

I wonder if this could be similar to the performance difference between static and member functions? Static method calls are not null checked, whereas instance function calls are null checked.

Also, I'm seeing different results than you are. The static arrays have a shorter run time on my machine. 64.x ms per run versus about 75.x ms.

Here's the complete program I used. Running on Mono C#, on OSX.

using System;
using System.Diagnostics;
using System.Linq;
using System.Collections.Generic;
using System.Text;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Time per run was: " + SimpleTest());
        }

        const int Runs = 10;
        const int Width = 5000;
        const int Height = 5000;
        const int Size = Width * Height;


        static int[] Input = Enumerable.Range(0, Size).ToArray();
        static int[] Output = new int[Size * 2];

        static float SimpleTest()
        {
            // Removing those 2 lines and using the static arrays instead give substantially slower performance, nearly half the speed!
            //int[] Input = Enumerable.Range(0, Size).ToArray();
            //int[] Output = new int[Size * 2];

            Stopwatch sw = new Stopwatch();
            sw.Start();

            for (int run = 0; run < Runs; run++)
            {
                int InputIndex = 0;
                for (int x = 0; x < Width; x++)
                {
                    for (int y = 0; y < Height; y++)
                    {
                        int pixel = Input[InputIndex];
                        var OutputIndex = InputIndex * 2;
                        Output[OutputIndex] = pixel;
                        Output[OutputIndex + 1] = pixel;
                        InputIndex++;
                    }
                }
            }
            sw.Stop();
            return (sw.ElapsedMilliseconds / (float)Runs);
        }
    }
}
Community
  • 1
  • 1
Nic Foster
  • 2,864
  • 1
  • 27
  • 45
  • I'm really interested in the reason why it is so in my case, even if the mono implementation does the reverse. – Ronan Thibaudau May 13 '15 at 05:22
  • @RonanThibaudau, it may be hardware specific, OS-specific, or maybe related to Mono. We'd need more people to run the code so we could see what the differences are. – Nic Foster May 13 '15 at 05:25
  • I really doubt it would be hardware specific, but it's not surprising 2 diferent (.net vs mono) implémentations behave differently. I'm really interested in why there is a speed difference on .net, since there's hardly a speed diference on mono i think it's a non issue there – Ronan Thibaudau May 13 '15 at 05:27
  • What about cache sizes on various CPUs? Is it possible the memory allocated for the static arrays lies elsewhere from the locally created array? In the case of using the static arrays, more of the data being modified was allocated outside of the function and lies instead with the class. Whereas with the local arrays the arrays themselves are in one segment of memory and allocated when the function is called, but Runs, Width, and Height are located outside of the function. – Nic Foster May 13 '15 at 05:30
  • I just tried moving all variables into the scope of the function, no performance difference, static version was still about 15% faster for me. – Nic Foster May 13 '15 at 05:32