Why C# is twice as slow as C++ even though the generated machine code is nearly identical?

Question

This code was generated by .NET Core 3.0 JIT, for my manually vectorized C# code:

00007FFE6C7D2103  vmovdqu     xmm5,xmmword ptr [rcx]  
00007FFE6C7D2107  vmovdqu     xmm6,xmmword ptr [rcx+10h]  
00007FFE6C7D210C  vmovdqu     xmm7,xmmword ptr [rcx+20h]  
00007FFE6C7D2111  vmovdqu     xmm8,xmmword ptr [rcx+30h]  
00007FFE6C7D2116  vpand       xmm9,xmm5,xmm0  
00007FFE6C7D211A  vpand       xmm10,xmm6,xmm0  
00007FFE6C7D211E  vpackusdw   xmm9,xmm9,xmm10  
00007FFE6C7D2123  vpslldq     xmm9,xmm9,1  
00007FFE6C7D2129  vpand       xmm10,xmm5,xmm1  
00007FFE6C7D212D  vpand       xmm11,xmm6,xmm1  
00007FFE6C7D2131  vpackusdw   xmm10,xmm10,xmm11  
00007FFE6C7D2136  vpsrldq     xmm5,xmm5,1  
00007FFE6C7D213B  vpsrldq     xmm6,xmm6,1  
00007FFE6C7D2140  vpand       xmm5,xmm5,xmm1  
00007FFE6C7D2144  vpand       xmm6,xmm6,xmm1  
00007FFE6C7D2148  vpackusdw   xmm5,xmm5,xmm6  
                var low = brightness( r, g, b, redMul, greenMul, blueMul );
00007FFE6C7D214D  vpmulhuw    xmm9,xmm9,xmm2  
00007FFE6C7D2151  vpmulhuw    xmm10,xmm10,xmm3  
00007FFE6C7D2155  vpmulhuw    xmm5,xmm5,xmm4  
00007FFE6C7D2159  vpaddusw    xmm6,xmm9,xmm10  
00007FFE6C7D215E  vpaddusw    xmm5,xmm6,xmm5  
00007FFE6C7D2162  vpsrlw      xmm5,xmm5,8  
00007FFE6C7D2167  vpand       xmm6,xmm7,xmm0  
00007FFE6C7D216B  vpand       xmm9,xmm8,xmm0  
00007FFE6C7D216F  vpackusdw   xmm6,xmm6,xmm9  
00007FFE6C7D2174  vpslldq     xmm9,xmm6,1  
00007FFE6C7D2179  vpand       xmm6,xmm7,xmm1  
00007FFE6C7D217D  vpand       xmm10,xmm8,xmm1  
00007FFE6C7D2181  vpackusdw   xmm10,xmm6,xmm10  
00007FFE6C7D2186  vpsrldq     xmm6,xmm7,1  
00007FFE6C7D218B  vpsrldq     xmm7,xmm8,1  
00007FFE6C7D2191  vpand       xmm6,xmm6,xmm1  
00007FFE6C7D2195  vpand       xmm7,xmm7,xmm1  
00007FFE6C7D2199  vpackusdw   xmm6,xmm6,xmm7  
                var hi = brightness( r, g, b, redMul, greenMul, blueMul );
00007FFE6C7D219E  vpmulhuw    xmm7,xmm9,xmm2  
00007FFE6C7D21A2  vpmulhuw    xmm8,xmm10,xmm3  
00007FFE6C7D21A6  vpmulhuw    xmm6,xmm6,xmm4  
00007FFE6C7D21AA  vpaddusw    xmm7,xmm7,xmm8  
00007FFE6C7D21AF  vpaddusw    xmm6,xmm7,xmm6  
00007FFE6C7D21B3  vpsrlw      xmm6,xmm6,8  
00007FFE6C7D21B8  vpackuswb   xmm5,xmm5,xmm6  
                Sse2.Store( dst, bytes );
00007FFE6C7D21BC  vmovdqu     xmmword ptr [rdx],xmm5  

                src += 64;
00007FFE6C7D21C0  add         rcx,40h  
                dst += 16;
00007FFE6C7D21C4  add         rdx,10h  
            while( src < srcEnd )
00007FFE6C7D21C8  cmp         rcx,rax  
00007FFE6C7D21CB  jb          00007FFE6C7D2103

This code was generated by VC++ 2015, when compiling my manually vectorized C++.

    {
        VecInteger r, g, b;

        loadRgb( src, r, g, b );
00007FF735AD11C0  vmovdqu     xmm6,xmmword ptr [rcx-10h]
00007FF735AD11C5  vmovdqu     xmm7,xmmword ptr [rcx-20h]

        loadRgb( src + 2, r, g, b );
00007FF735AD11CA  vmovdqu     xmm9,xmmword ptr [rcx]
00007FF735AD11CE  vmovdqu     xmm8,xmmword ptr [rcx+10h]
    {
        VecInteger r, g, b;

        loadRgb( src, r, g, b );
00007FF735AD11D3  vpand       xmm3,xmm10,xmm6  
00007FF735AD11D7  vpand       xmm1,xmm11,xmm6  
00007FF735AD11DB  vpand       xmm0,xmm11,xmm7  
00007FF735AD11DF  vpackusdw   xmm1,xmm0,xmm1  
00007FF735AD11E4  vpslldq     xmm2,xmm1,1  
        const auto low = brightness( r, g, b );
00007FF735AD11E9  vpmulhuw    xmm4,xmm2,xmm12  
00007FF735AD11EE  vpand       xmm0,xmm10,xmm7  
00007FF735AD11F2  vpackusdw   xmm1,xmm0,xmm3  
        const auto low = brightness( r, g, b );
00007FF735AD11F7  vpmulhuw    xmm2,xmm1,xmm13  
00007FF735AD11FC  vpaddusw    xmm5,xmm4,xmm2  
    {
        VecInteger r, g, b;

        loadRgb( src, r, g, b );
00007FF735AD1200  vpsrldq     xmm0,xmm6,1  
00007FF735AD1205  vpand       xmm3,xmm0,xmm10  
00007FF735AD120A  vpsrldq     xmm1,xmm7,1  
00007FF735AD120F  vpand       xmm2,xmm1,xmm10  
00007FF735AD1214  vpackusdw   xmm0,xmm2,xmm3  
        const auto low = brightness( r, g, b );
00007FF735AD1219  vpmulhuw    xmm3,xmm0,xmm14  
00007FF735AD121E  vpaddusw    xmm1,xmm5,xmm3  
00007FF735AD1222  vpsrlw      xmm6,xmm1,8  

        loadRgb( src + 2, r, g, b );
00007FF735AD1227  vpand       xmm2,xmm11,xmm8  
00007FF735AD122C  vpand       xmm0,xmm11,xmm9  
00007FF735AD1231  vpackusdw   xmm1,xmm0,xmm2  
00007FF735AD1236  vpslldq     xmm2,xmm1,1  
        const auto hi = brightness( r, g, b );
00007FF735AD123B  vpmulhuw    xmm4,xmm2,xmm12  

        loadRgb( src + 2, r, g, b );
00007FF735AD1240  vpand       xmm0,xmm10,xmm9  
00007FF735AD1245  vpand       xmm3,xmm10,xmm8  
00007FF735AD124A  vpackusdw   xmm1,xmm0,xmm3  
        const auto hi = brightness( r, g, b );
00007FF735AD124F  vpmulhuw    xmm2,xmm1,xmm13  
00007FF735AD1254  vpaddusw    xmm5,xmm4,xmm2  

        loadRgb( src + 2, r, g, b );
00007FF735AD1258  vpsrldq     xmm1,xmm9,1  
00007FF735AD125E  vpand       xmm2,xmm1,xmm10  
00007FF735AD1263  vpsrldq     xmm0,xmm8,1  
00007FF735AD1269  vpand       xmm3,xmm0,xmm10  
00007FF735AD126E  vpackusdw   xmm0,xmm2,xmm3  
        const auto hi = brightness( r, g, b );
00007FF735AD1273  vpmulhuw    xmm3,xmm0,xmm14  
00007FF735AD1278  vpaddusw    xmm1,xmm5,xmm3  
00007FF735AD127C  vpsrlw      xmm2,xmm1,8  

        src += 4;
00007FF735AD1281  lea         rcx,[rcx+40h]  

        const auto bytes = packus_epi16( low, hi );
00007FF735AD1285  vpackuswb   xmm0,xmm6,xmm2  
    VecInteger* dest = (VecInteger*)destinationBytes;

    while( src < srcEnd )
00007FF735AD1289  lea         rax,[rcx-20h]  
        storeu_all( dest, bytes );
00007FF735AD128D  vmovdqu     xmmword ptr [rdx],xmm0  
        dest++;
00007FF735AD1291  lea         rdx,[rdx+10h]  
00007FF735AD1295  cmp         rax,r8  
00007FF735AD1298  jb          Sse::convertToGrayscale+80h (07FF735AD11C0h)

Both snippets above only include the main loop of the program. As you see, they have nearly identical instructions, yet C# is twice as slow as C++.

Specifically, when tested with 511M pixels, the result on my PC (AMD Ryzen 5 3600) C++ code takes 221 ms, C# code takes 410 ms.

Why?

See Why is C# twice as slow as C++ even though the generated machine code is nearly identical? for the C# source.

C++ source code: https://github.com/Const-me/IntelIntrinsics/blob/master/CppDemo/brightness.cpp https://github.com/Const-me/IntelIntrinsics/blob/master/CppDemo/brightness.inl

Stack Overflow questions must be self-contained. Please copy the relevant code into your question. — fuz, Nov 22 '19 at 23:43
How are you testing this... how have you removed the jitter from the equation in regards to your bench-marking paradigm ? — TheGeneral, Nov 22 '19 at 23:49
@TheGeneral Could you please elaborate? My CPU can’t run IL. It can only run x86. CPU doesn’t care who made these instructions, offline C++ compiler, or the JIT. — Soonts, Nov 22 '19 at 23:52
So you are are saying that you are just running the instructions (there is no .net application involved, when testing) ? I ask so we can cancel out .net itself, and focus on the assembly only. You also need to tell us how you are testing this, how do you get your benchmarking results. — TheGeneral, Nov 22 '19 at 23:54
For one thing the .NET version is misaligned. Probably not a cause for a 2x slowdown however. — Jester, Nov 23 '19 at 00:03
@TheGeneral It’s involved a lot, but irrelevant for the performance. Take a look: https://gist.github.com/Const-me/ffd8e3febeaf9a8dcfc359d1848d47a7 — Soonts, Nov 23 '19 at 00:08
You should probably link the RGB -> grayscale SO question that has your C# code, and at least link the original C++ this was compiled from. Even if I wanted to profile these instructions on my own CPU, the form in the question isn't suitable for copy/paste into anything. It's hard to make sense of this much code without definitions for C functions like `brightness()`. e.g. why are there couple a `vpsrldq` by 1 byte instructions in your C output, then `vpand`? Is the .NET output obviously less efficient in some way that you've noticed? — Peter Cordes, Nov 23 '19 at 00:13
I finally looked at github. this is not how we benchmark things `Console.WriteLine( "{0}ms", sw.Elapsed.TotalMilliseconds );` We use tools like Benchmark.net, we also make sure we are in release mode and detached from the ide. Now since you are comparing it with c++ we also need to guarantee we are comparing apples to apples. In short,. you are running this once, and its just not good enough. yes c++ will be faster hands-down but i can tell you your testing methodology is flawed unfortunately and the comparisons are most likely a skewed — TheGeneral, Nov 23 '19 at 00:15
What you are comparing here, is probably the loading of .net dependencies, the jitter, and to a smaller extent, the actual code. If you want to really have a good comparison, run this 1000 times, use a benchmarking tool, use unsafe code and pointers, Make sure you are making good use of the cache, ect — TheGeneral, Nov 23 '19 at 00:19
the instruction order seems to be a bit different, is that enough for the CPU to get more parallelism out of it in the C++ version? that said, benchmarking is hard. maybe you could put both bits of assembly into a single (compiled) test harness and run them a few times. given that they take ~300ms running them a dozen times should be enough to warm everything up — Sam Mason, Nov 23 '19 at 00:25
@TheGeneral Thanks for the tip. It indeed was the JIT overhead. When I run same code the second time it’s pretty close to the C++ version. Here’s the source: https://gist.github.com/Const-me/3a862b9994ba82aa3e9036607af93d15 — Soonts, Nov 23 '19 at 00:27
Probably more than just the JIT, as .net loads assemblies when it needs them, anyway ill just say "I told you so" :P — TheGeneral, Nov 23 '19 at 00:28
There is a great thread to this already, hopefully [this](https://stackoverflow.com/questions/5326269/is-c-sharp-really-slower-than-say-c) is useful. — Leo Guagenti, Nov 23 '19 at 00:54

score 4 · Accepted Answer · answered Nov 23 '19 at 00:47

4

The reason is JIT overhead. When benchmarking .NET code, you should always discard the first measure, because it includes time the runtime spent to produce x86 code out of the IL.

Here’s what the test app prints after I’ve measured 3 times instead of just 1 (for 511M pixels):

#1 391.1885 ms, #2 216.985 ms, #3 235.5549 ms

Source code: https://gist.github.com/Const-me/0f0c283a0b998aa9977550d85fa33958

These ~220 ms is pretty close to the performance of the equivalent C++ code. So the C# SIMD is not that bad after all.

answered Nov 23 '19 at 00:47

Soonts

20,079
9
57
130

It should be noted for future readers, that performance benchmark is a hard thing to achieve, a proper tool should be used that has the ability to prewarm and run multiple parses as well as exclude things like the jitter and lazily loading assemblies (among other things), as well a limitation by the use of StopWatch – TheGeneral Nov 23 '19 at 00:53
Agreed. These days we usually point people at [BenchmarkDotNet](https://github.com/dotnet/BenchmarkDotNet). While it won't fix all your benchmarking related issues, it takes care of a lot of the common ones. – Andy Ayers Nov 23 '19 at 01:21
Beside of your solved problem, you probably learnt something else: C# vs. C++ in a title in SO is a bad idea. It raises much (bad) emotions before the people are able to look onto your code. Maybe, you should've exposed your actual issue (e.g. "different run-time for nearly identical asm") in the title and have mentioned C++ vs. C# at the bottom in a foot note... ;-) But, respect, that you dared a second turn beside of all that negative feed-back in the first. – Scheff's Cat Nov 23 '19 at 07:49
2

@Scheff “It raises much (bad) emotions before the people are able to look onto your code” I still don’t get why so many people here view the question about their relative performance as a holy war or something. I’ve been programming both languages for many years, often in the same project. This allows to leverage strengths of both, and workaround their weaknesses. BTW, here’s why I had the question: https://stackoverflow.com/q/58881359/126995 https://github.com/dotnet/coreclr/issues/27909 – Soonts Nov 23 '19 at 14:23
2

_as a holy war or something_ That's easy: The C++ guys are jealous how fast the C# guys get their stuff running. The C# guys are jealous about the fine-grained memory management in C++ (no GC which starts to clean-up when performance is urgently needed). :-) Seriously: Your multi-language approach appears reasonable to me. It's just: A vote is given quickly - no need to read carefully - even if the questioner shows a significant rep. And once, you have a vote they might sum up, especially for down-votes. We humans are gregarious animals... ;-) – Scheff's Cat Nov 23 '19 at 14:46

Why C# is twice as slow as C++ even though the generated machine code is nearly identical?

1 Answers1