Why the performance difference between C# (quite a bit slower) and Win32/C?

Question

We are looking to migrate a performance critical application to .Net and find that the c# version is 30% to 100% slower than the Win32/C depending on the processor (difference more marked on mobile T7200 processor). I have a very simple sample of code that demonstrates this. For brevity I shall just show the C version - the c# is a direct translation:

#include "stdafx.h"
#include "Windows.h"

int array1[100000];
int array2[100000];

int Test();

int main(int argc, char* argv[])
{
    int res = Test();

    return 0;
}

int Test()
{
    int calc,i,k;
    calc = 0;

    for (i = 0; i < 50000; i++) array1[i] = i + 2;

    for (i = 0; i < 50000; i++) array2[i] = 2 * i - 2;

    for (i = 0; i < 50000; i++)
    {
        for (k = 0; k < 50000; k++)
        {
            if (array1[i] == array2[k]) calc = calc - array2[i] + array1[k];
            else calc = calc + array1[i] - array2[k];
        } 
    }
    return calc;
}

If we look at the disassembly in Win32 for the 'else' we have:

35:               else calc = calc + array1[i] - array2[k]; 
004011A0   jmp         Test+0FCh (004011bc)
004011A2   mov         eax,dword ptr [ebp-8]
004011A5   mov         ecx,dword ptr [ebp-4]
004011A8   add         ecx,dword ptr [eax*4+48DA70h]
004011AF   mov         edx,dword ptr [ebp-0Ch]
004011B2   sub         ecx,dword ptr [edx*4+42BFF0h]
004011B9   mov         dword ptr [ebp-4],ecx

(this is in debug but bear with me)

The disassembly for the optimised c# version using the CLR debugger on the optimised exe:

                    else calc = calc + pev_tmp[i] - gat_tmp[k];
000000a7  mov         eax,dword ptr [ebp-4] 
000000aa  mov         edx,dword ptr [ebp-8] 
000000ad  mov         ecx,dword ptr [ebp-10h] 
000000b0  mov         ecx,dword ptr [ecx] 
000000b2  cmp         edx,dword ptr [ecx+4] 
000000b5  jb          000000BC 
000000b7  call        792BC16C 
000000bc  add         eax,dword ptr [ecx+edx*4+8]
000000c0  mov         edx,dword ptr [ebp-0Ch] 
000000c3  mov         ecx,dword ptr [ebp-14h] 
000000c6  mov         ecx,dword ptr [ecx] 
000000c8  cmp         edx,dword ptr [ecx+4]
000000cb  jb          000000D2 
000000cd  call        792BC16C 
000000d2  sub         eax,dword ptr [ecx+edx*4+8] 
000000d6  mov         dword ptr [ebp-4],eax

Many more instructions, presumably the cause of the performance difference.

So 3 questions really:

Am I looking at the correct disassembly for the 2 programs or are the tools misleading me?
If the difference in the number of generated instructions is not the cause of the difference what is?
What can we possibly do about it other than keep all our performance critical code in a native DLL.

Thanks in advance Steve

PS I did receive an invite recently to a joint MS/Intel seminar entitled something like 'Building performance critical native applications' Hmm...

Could you remove all the newlines between the assembly instructions. — Wadih M., Jun 29 '09 at 19:30
As always, profile it to see exactly what costs the most performance hit. (There's no way we can see what takes the time in your code, so no point in asking us. Ask a profiler instead) Apart from that, a simple trick might be to run your C# code through NGen. That should boost performance quite a bit. — jalf, Jun 29 '09 at 19:33
Which version of the CLR you are comparing to. As far as I know, .NET 3.5 SP1 JIT compiler is more efficient than the old ones. Also x64 JIT optimizer is more aggressive than x86 one. — Mehrdad Afshari, Jun 29 '09 at 19:34
By the way, the "direct" C# translation is important. And are you sure you're checking the JIT generated assembly with optimization enabled? — Mehrdad Afshari, Jun 29 '09 at 19:36
See also this related question: http://stackoverflow.com/questions/883642/why-would-i-see-20-speed-increase-using-native-code — Dirk Vollmar, Jun 29 '09 at 20:34

score 18 · Answer 1 · edited May 23 '17 at 11:53

18

I believe your main issue in this code is going to be bounds checking on your arrays.

If you switch to using unsafe code in C#, and use pointer math, you should be able to achieve the same (or potentially faster) code.

This same issue was previously discussed in detail in this question.

edited May 23 '17 at 11:53

Community

1
1

answered Jun 29 '09 at 19:35

Reed Copsey

554,122
78
1,158
1,373

score 13 · Answer 2 · answered Jun 29 '09 at 19:33

13

I believe you are seeing the results of bounds checks on the arrays. You can avoid the bounds checks by using unsafe code.

I believe the JITer can recognize patterns like for loops that go up to array.Length and avoid the bounds check, but it doesn't look like your code can utilizate that.

answered Jun 29 '09 at 19:33

Michael

54,279
5
125
144

9

I see a lot of these apples-oranges "identical code" attempts at perf comparisons with toy code. Yet I never see a negative comparison with full, product-quality code of comparable quality. Maybe because c# isn't actually slower. – Greg D Jun 29 '09 at 19:37
1

@Greg D: I agree. I work almost exclusively on high performance, scientific oriented numerical processing. C# does have a very different perf. profile than C++, though, so profiling is critical - but in general, you can get C# to be just as fast as C++ with the right profiling and adjustments to the code. – Reed Copsey Jun 29 '09 at 19:42
2

@Greg, Reed - Most of the issues that I see with managed code performance aren't around CPU time like this, but things like load time and memory footprint. For these, C++ still has a huge advantage (though bad programmers can easily negate that advantage :) – Michael Jun 29 '09 at 19:45
@Michael: True. Startup time, in particular, tends to suffer in a managed world. Memory limits on 32bit are another issue where managed doesn't always live up to native (managed typically caps at 1.2-1.4gb/process, although the compacting GC can make up for this in most cases). – Reed Copsey Jun 29 '09 at 19:47

score 6 · Answer 3 · answered Jun 29 '09 at 19:40

6

As others have said, one of the aspects is bounds checking. There's also some redundancy in your code in terms of array access. I've managed to improve the performance somewhat by changing the inner block to:

int tmp1 = array1[i];
int tmp2 = array2[k];
if (tmp1 == tmp2)
{
    calc = calc - array2[i] + array1[k];
}
else
{
    calc = calc + tmp1 - tmp2;
}

That change knocked the total time down from ~8.8s to ~5s.

answered Jun 29 '09 at 19:40

Jon Skeet

1,421,763
867
9,128
9,194

@Jon: Maybe I'm missing something, but I cannot measure any significant performance difference between your version and the OP's version. In fact, I also would not expect such a rather minimal change to have such an impact on performance. – Dirk Vollmar Jun 29 '09 at 20:31
Neither would I particularly, but it certainly does for me, on both .NET 3.5 and 4.0b1. Compiled with /o+ /debug- on 32 bit Vista as a console app. I've also changed the scope of the i and k variables, but I doubt that that's significant. – Jon Skeet Jun 29 '09 at 20:58
(I've tested it enough times to make sure it's not just a fluke, btw :) – Jon Skeet Jun 29 '09 at 20:58
@Jon: "I've also changed the scope of the i and k variables, but I doubt that that's significant." I checked this and it seems that a limited scope for i and k is actually the reason for the performance improvement. If i and k are just local to the for-loop the optimizer is probably able to remove the bounds check as it can determine that i and k are always within the bounds of the array (I checked this on XP/.NET 3.5). – Dirk Vollmar Jun 30 '09 at 07:38
It's not *just* that though - when I first just changed the scope, it made no difference - making the change specified in the answer then made a huge difference. I guess it's the two things combined. – Jon Skeet Jun 30 '09 at 08:29

score 4 · Answer 4 · answered Jun 29 '09 at 20:04

Just for fun, I tried building this in C# in Visual Studio 2010, and took a look at the JITed disassembly:

                    else 
                        calc = calc + array1[i] - array2[k];
000000cf  mov         eax,dword ptr [ebp-10h] 
000000d2  add         eax,dword ptr [ebp-14h] 
000000d5  sub         eax,edx 
000000d7  mov         dword ptr [ebp-10h],eax

They made a number of improvements to the jitter in 4.0 of the CLR.

score 2 · Answer 5 · answered Jun 29 '09 at 19:35

2

C# is doing bounds checking

when running the calculation part in C# unsafe code does it perform as well as the native implementation?

answered Jun 29 '09 at 19:35

SQLMenace

132,095
25
206
225

score 1 · Answer 6 · edited May 23 '17 at 10:29

If your application's performance critical path consists entirely of unchecked array processing, I'd advise you not to rewrite it in C#.

But then, if your application already works fine in language X, I'd advise you not to rewrite it in language Y.

What do you want to achieve from the rewrite? At the very least, give serious consideration to a mixed language solution, using your already-debugged C code for the high performance sections and using C# to get a nice user interface or convenient integration with the latest rich .NET libraries.

A longer answer on a possibly related theme.

score 0 · Answer 7 · answered Jun 29 '09 at 19:34

I am sure the optimization for C is different than C#. Also you have to expect at least a little bit of performance slow down. .NET adds another layer to the application with the framework.

The trade off is more rapid development, huge libraries and functions, for (what should be) a small amount of speed.

Why the performance difference between C# (quite a bit slower) and Win32/C?

7 Answers7