Why does Mono run a simple method slower whereas RyuJIT runs it significantly faster?

Question

I created a simple benchmark out of curiosity, but cannot explain the results.

As benchmark data, I prepared an array of structs with some random values. The preparation phase is not benchmarked:

struct Val 
{
    public float val;
    public float min;
    public float max;
    public float padding;
}

const int iterations = 1000;
Val[] values = new Val[iterations];
// fill the array with randoms

Basically, I wanted to compare these two clamp implementations:

static class Clamps
{
    public static float ClampSimple(float val, float min, float max)
    {
        if (val < min) return min;          
        if (val > max) return max;
        return val;
    }

    public static T ClampExt<T>(this T val, T min, T max) where T : IComparable<T>
    {
        if (val.CompareTo(min) < 0) return min;
        if (val.CompareTo(max) > 0) return max;
        return val;
    }
}

Here are my benchmark methods:

[Benchmark]
public float Extension()
{
    float result = 0;
    for (int i = 0; i < iterations; ++i)
    {
        ref Val v = ref values[i];
        result += v.val.ClampExt(v.min, v.max);
    }

    return result;
}

[Benchmark]
public float Direct()
{
    float result = 0;
    for (int i = 0; i < iterations; ++i)
    {
        ref Val v = ref values[i];
        result += Clamps.ClampSimple(v.val, v.min, v.max);
    }

    return result;
}

I'm using BenchmarkDotNet version 0.10.12 with two jobs:

[MonoJob]
[RyuJitX64Job]

And these are the results I get:

BenchmarkDotNet=v0.10.12, OS=Windows 7 SP1 (6.1.7601.0)
Intel Core i7-6920HQ CPU 2.90GHz (Skylake), 1 CPU, 8 logical cores and 4 physical cores
Frequency=2836123 Hz, Resolution=352.5940 ns, Timer=TSC
  [Host]    : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3062.0
  Mono      : Mono 5.12.0 (Visual Studio), 64bit
  RyuJitX64 : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3062.0


    Method |       Job | Runtime |      Mean |     Error |    StdDev |
---------- |---------- |-------- |----------:|----------:|----------:|
 Extension |      Mono |    Mono | 10.860 us | 0.0063 us | 0.0053 us |
    Direct |      Mono |    Mono | 11.211 us | 0.0074 us | 0.0062 us |
 Extension | RyuJitX64 |     Clr |  5.711 us | 0.0014 us | 0.0012 us |
    Direct | RyuJitX64 |     Clr |  1.395 us | 0.0056 us | 0.0052 us |

I can accept that Mono is somewhat slower here in general. But what I don't understand is:

Why does Mono run the Direct method slower than Extension keeping in mind that Direct uses a very simple comparison method whereas Extension uses a method with additional method calls?

RyuJIT shows here a 4x advantage of the simple method.

Can anyone explain this?

Unless you are able to supply us with the generated assembly, it is very hard to guess why the performance is what it is. I would actually expect this code to be dominated by array bound checking, memory copying, cache misses etc. than the actual user code you have shown. Also, how many benchmarks did you try? Did you try with higher iteration counts? What were the results? What you have shown is ~3% performance difference on microseconds level which seems to me more as equivalent than anything else. — Zdeněk Jelínek, Oct 19 '18 at 10:34
@ZdeněkJelínek, yes, I tried with different iteration counts (100, 1.000, 10.000, 100.000) . The benchmarks created with `BenchmarkDotNet` are already clever enough (warm-up phase, multiple iterations etc) so I just trust them. My question is however not about the 3% (BTW, what do you mean by 3%?), but about the difference between Mono's and RyuJIT's performance: Mono runs `Extension` and `Direct` tests quite equally fast while RyuJIT runs the `Direct` benchmark 4x faster than `Extension`. You don't need the assemblies, just generate them yourself using BenchmarkDotNet and the code I provided. — dymanoid, Oct 22 '18 at 11:41
I wonder how the code would be efficient under .NET Core 2.1 runtime. Would be any difference with RyuJitX64 : .NET Framework 4.7? — Grzesiek Danowski, Oct 26 '18 at 15:20

score 2 · Accepted Answer · answered Oct 26 '18 at 15:10

Since nobody wanted to do some disassembly stuff, I answer my own question.

It seems that the reason is the native code being generated by the JITs, not the array boundary checking or caching issues mentioned in the comments.

RyuJIT generates a very efficient code for the ClampSimple method:

    vucomiss xmm1,xmm0
    jbe     M01_L00
    vmovaps xmm0,xmm1
    ret

M01_L00:
    vucomiss xmm0,xmm2
    jbe     M01_L01
    vmovaps xmm0,xmm2
    ret

M01_L01:
    ret

It uses the CPU's native ucomiss operations to compare floats and also fast movaps operations to move those floats between CPU's registers.

The extension method is slower because it has a couple of function calls to System.Single.CompareTo(System.Single), here's the first branch:

lea     rcx,[rsp+30h]
vmovss  dword ptr [rsp+38h],xmm1
call    mscorlib_ni+0xda98f0
test    eax,eax
jge     M01_L00
vmovss  xmm0,dword ptr [rsp+38h]
add     rsp,28h
ret

Let's have a look at the native code Mono produces for the ClampSimple method:

    cvtss2sd    xmm0,xmm0  
    movss       xmm1,dword ptr [rsp+8]  
    cvtss2sd    xmm1,xmm1  
    comisd      xmm1,xmm0  
    jbe         M01_L00  
    movss       xmm0,dword ptr [rsp+8]  
    cvtss2sd    xmm0,xmm0  
    cvtsd2ss    xmm0,xmm0  
    jmp         M01_L01 

M01_L00: 
    movss       xmm0,dword ptr [rsp]  
    cvtss2sd    xmm0,xmm0  
    movss       xmm1,dword ptr [rsp+10h]  
    cvtss2sd    xmm1,xmm1  
    comisd      xmm1,xmm0  
    jp          M01_L02
    jae         M01_L02  
    movss       xmm0,dword ptr [rsp+10h]  
    cvtss2sd    xmm0,xmm0  
    cvtsd2ss    xmm0,xmm0  
    jmp         M01_L01

M01_L02:
    movss       xmm0,dword ptr [rsp]  
    cvtss2sd    xmm0,xmm0  
    cvtsd2ss    xmm0,xmm0  

M01_L01:
    add         rsp,18h  
    ret

Mono's code converts floats to doubles and compares them using comisd. Furthermore, there are strange "convert flips" float ➞ double ➞ float when preparing the return value. And also there is much more moving around between memory and registers. This explains why Mono's code for the simple method is slower compared to RyuJIT's one.

The Extension method code is very similar to the RyuJIT's code, but again with strange converting flips float ➞ double ➞ float:

movss       xmm0,dword ptr [rbp-10h]  
cvtss2sd    xmm0,xmm0  
movsd       xmm1,xmm0  
cvtsd2ss    xmm1,xmm1  
lea         rbp,[rbp]  
mov         r11,2061520h  
call        r11  
test        eax,eax  
jge         M0_L0 
movss       xmm0,dword ptr [rbp-10h]  
cvtss2sd    xmm0,xmm0  
cvtsd2ss    xmm0,xmm0
ret

It seems that RyuJIT can generate more efficient code for handling floats. Mono treats floats as doubles and converts the values each time, which also causes additional value transfers between CPU registers and memory.

Note that all this is valid for Windows x64 only. I don't know how this benchmark will perform on Linux or Mac.

*Since nobody wanted to do some disassembly stuff* is a bit of a weird statement. You chose to include only source in your question, not asm, so I couldn't have looked at the efficiency of the asm for you. I'm not aware of an online compiler-explorer equivalent for C# like https://godbolt.org/ has for C, C++, Rust, and some other languages, and I don't have Windows. Some other x86 performance experts that look at SO questions are in the same boat, I think. — Peter Cordes, Oct 26 '18 at 15:25
@PeterCordes, I asked this question a couple of months ago and I had no idea that the reason is the generated native code. I first thought it can be explained by the .NET runtime specialties etc. So it's a bit of a weird thing to blame me for not knowing I should have posted the disassembled native code right away. — dymanoid, Oct 26 '18 at 15:29
Anyway, wow that's terrible asm from Mono. If you could write your source in a way that got RyuJIT to use branchless `maxss` / `minss` for clamping, it might be even faster. (Or slower if the value is almost always in-bounds so the branch predicts well, and latency is the bottleneck. Or especially almost always *out* of bounds would make branching decouple the the result from the input, breaking the data dependency chain.) — Peter Cordes, Oct 26 '18 at 15:34
The answer to any performance question for a tiny loop is going to come down to looking at the asm to figure out where the instructions in the loop came from. If you find that the benchmark scales linearly with the iteration count, you know the time is being spent in JITed native code, not in startup overhead or something. (which is what you want). TL:DR: you pretty much always need to post the asm for anyone to answer why one compiler is faster than another for a single loop. — Peter Cordes, Oct 26 '18 at 15:37
@PeterCordes, thanks for the feedback, but I doubt we in the .NET world have any means to affect what the JIT chooses when jitting a managed method. But I was really surprised by the terrible performance of that simple method in Mono and wanted to get an answer why is that. — dymanoid, Oct 26 '18 at 15:38
Right, I mean you'd write the source differently to hand-hold the JIT compiler. Like maybe `val = (val>min) ? val : min;` instead of an early `return`, to encourage branchless code. See [What is the instruction that gives branchless FP min and max on x86?](https://stackoverflow.com/q/40196817) — Peter Cordes, Oct 26 '18 at 15:39

Why does Mono run a simple method slower whereas RyuJIT runs it significantly faster?

1 Answers1