20x performance difference Interlocked.Read vs Interlocked.CompareExchange though both implemented with lock cmpxchg

Question

Question

I wanted to write a small profiler class that allows me to measure the run time of hot paths throughout the application. In doing so, I discovered and interesting performance difference between two possible implementations that I cannot explain, but would like to understand.

Setup

The idea is as follows:

// somewhere accessible
public static profiler HotPathProfiler = new HotPathProfiler("some name", enabled: true);

// within programm
long ticket = profiler.Enter();
... // hot path
var result = profiler.Exit(ticket: ticket);

As there aren't much of these hot paths running in parallel, the idea is to implement this via an array that holds the timestamp (0 when slot is free) and returning the index (called ticket) when calling Enter(). So the class looks like the following:

public class HotPathProfiler
{
    private readonly string _name;
    private readonly bool _enabled;
    private readonly long[] _ticketList;

    public HotPathProfiler(string name, bool enabled)
    {
        _name = name;
        _enabled = enabled;
        _ticketList = new long[128];
    }
}

If code Enter()s and none of the 128 tickets is available, -1 will be returned which the Exit(ticket) function can handle by returning early.

When thinking about how to implement the Enter() call I saw the Interlocked.Read method that can atomically read values on 32bit systems, while, according to the documentation, it is unnecessary on 64bit systems.

So I went on an implemented various types of Enter() methods, including one with Interlocked.Read and one with Interlocked.CompareExchange, and compared them with BenchmarkDotNet. That's where I discovered an enormous performance difference:

|       Method |      Mean |    Error |   StdDev | Code Size |
|------------- |----------:|---------:|---------:|----------:|
|    SafeArray |  28.64 ns | 0.573 ns | 0.536 ns |     295 B |
| SafeArrayCAS | 744.75 ns | 8.741 ns | 7.749 ns |     248 B |

The benchmark for both look pretty much the same:

        [Benchmark]
        public void SafeArray()
        {
            // doesn't matter if 'i < 1' or 'i < 10'
            // performance differs by the same factor (approx. 20x)
            for (int i = 0; i < 1; i++)
            {
                _ticketArr[i] = _hpp_sa.EnterSafe(); 

                // SafeArrayCAS:
                // _ticketArr[i] = _hpp_sa_cas.EnterSafe(); 
            }
        }

Implementations

Again, free slots hold value 0, occupied slots some other value (timestamp). Enter() is supposed to return the index/ticket of the slot.

SafeArrayCAS (slow)

        public long EnterSafe()
        {
            if (!_enabled)
            {
                return -1;
            }
            long last = 0;
            long ts = Stopwatch.GetTimestamp();
            long val;
            do
            {
                val = Interlocked.CompareExchange(ref _ticketList[last], ts, 0);
                last++;
            } while (val != 0 && last < 128);
            return val == 0 ? last : -1;
        }

SafeArray (fast)

        public long EnterSafe()
        {
            if (!_enabled)
            {
                return -1;
            }
            long last = 0;
            long val;
            do
            {
                val = Interlocked.Read(ref _ticketList[last]);
                last++;
            } while (val != 0 && last < 128);
            if (val != 0)
            {
                return -1;
            }
            long prev = Interlocked.CompareExchange(ref _ticketList[last], Stopwatch.GetTimestamp(), 0);
            if (prev != 0)
            {
                return -1;
            }
            return last;
        }

Enter rabbit hole

Now, one would say that it's no surprise to see a difference, since the slow method always tries to CAS an entry, while the other one only lazily reads each entry and then only tries a CAS once.

But, beside the fact that the benchmark only does 1 Enter(), i.e. only one while{} run that shouldn't make that (20x) much difference, it is even harder to explain once you realize the atomic read is implemented as CAS:

SafeArrayCAS (slow)

        public long EnterSafe()
        {
            if (!_enabled)
[...] // ommited for brevity 
            {
                return -1;
[...] // ommited for brevity   
            }
            long last = 0;
00007FF82D048FCE  xor         edi,edi  
            long ts = Stopwatch.GetTimestamp();
00007FF82D048FD0  lea         rcx,[rsp+28h]  
00007FF82D048FD5  call        CLRStub[JumpStub]@7ff82d076d70 (07FF82D076D70h)   
00007FF82D048FDA  mov         rsi,qword ptr [rsp+28h]  
00007FF82D048FDF  mov         rax,7FF88CF3E07Ch  
00007FF82D048FE9  cmp         dword ptr [rax],0  
00007FF82D048FEC  jne         HotPathProfilerSafeArrayCAS.EnterSafe()+0A6h (07FF82D049046h)  
            long val;
            do
            {
                val = Interlocked.CompareExchange(ref _ticketList[last], ts, 0);
00007FF82D048FEE  mov         rbx,qword ptr [rsp+50h]  
00007FF82D048FF3  mov         rax,qword ptr [rbx+10h]  
00007FF82D048FF7  mov         edx,dword ptr [rax+8]  
00007FF82D048FFA  movsxd      rdx,edx  
00007FF82D048FFD  cmp         rdi,rdx  
00007FF82D049000  jae         HotPathProfilerSafeArrayCAS.EnterSafe()+0ADh (07FF82D04904Dh)  
00007FF82D049002  lea         rdx,[rax+rdi*8+10h]  
00007FF82D049007  xor         eax,eax  
00007FF82D049009  lock cmpxchg qword ptr [rdx],rsi  
                last++;
00007FF82D04900E  inc         rdi  
            } while (val != 0 && last < 128);
00007FF82D049011  test        rax,rax  
00007FF82D049014  je          HotPathProfilerSafeArrayCAS.EnterSafe()+084h (07FF82D049024h)  
00007FF82D049016  cmp         rdi,80h  
00007FF82D04901D  mov         qword ptr [rsp+50h],rbx  
00007FF82D049022  jl          HotPathProfilerSafeArrayCAS.EnterSafe()+04Eh (07FF82D048FEEh)

SafeArray (fast)

        public long EnterSafe()
        {
            if (!_enabled)
[...] // ommited for brevity 
            {
                return -1;
[...] // ommited for brevity
            }
            long last = 0;
00007FF82D046C74  xor         esi,esi  
            long val;
            do
            {
                val = Interlocked.Read(ref _ticketList[last]);
00007FF82D046C76  mov         rax,qword ptr [rcx+10h]  
00007FF82D046C7A  mov         edx,dword ptr [rax+8]  
00007FF82D046C7D  movsxd      rdx,edx  
00007FF82D046C80  cmp         rsi,rdx  
00007FF82D046C83  jae         HotPathProfilerSafeArray.EnterSafe()+0DCh (07FF82D046D2Ch)  
00007FF82D046C89  lea         rdx,[rax+rsi*8+10h]  
00007FF82D046C8E  xor         r8d,r8d  
00007FF82D046C91  xor         eax,eax  
00007FF82D046C93  lock cmpxchg qword ptr [rdx],r8  
                last++;
00007FF82D046C98  inc         rsi  
            } while (val != 0 && last < 128);
00007FF82D046C9B  test        rax,rax  
00007FF82D046C9E  je          HotPathProfilerSafeArray.EnterSafe()+059h (07FF82D046CA9h)  
00007FF82D046CA0  cmp         rsi,80h  
00007FF82D046CA7  jl          HotPathProfilerSafeArray.EnterSafe()+026h (07FF82D046C76h)  
            if (val != 0)
[...] // ommited for brevity   
            {
                return -1;
[...] // ommited for brevity  
            }
            long prev = Interlocked.CompareExchange(ref _ticketList[last], Stopwatch.GetTimestamp(), 0);
00007FF82FBA6ADF  mov         rcx,qword ptr [rcx+10h]  
00007FF82FBA6AE3  mov         eax,dword ptr [rcx+8]  
00007FF82FBA6AE6  movsxd      rax,eax  
00007FF82FBA6AE9  cmp         rsi,rax  
00007FF82FBA6AEC  jae         HotPathProfilerSafeArray.EnterSafe()+0DCh (07FF82FBA6B4Ch)  
00007FF82FBA6AEE  lea         rdi,[rcx+rsi*8+10h]  
00007FF82FBA6AF3  mov         qword ptr [rsp+28h],rdi  
00007FF82FBA6AF8  lea         rcx,[rsp+30h]  
00007FF82FBA6AFD  call        CLRStub[JumpStub]@7ff82d076d70 (07FF82D076D70h)  
00007FF82FBA6B02  mov         rdx,qword ptr [rsp+30h]  
00007FF82FBA6B07  xor         eax,eax  
00007FF82FBA6B09  mov         rdi,qword ptr [rsp+28h]  
00007FF82FBA6B0E  lock cmpxchg qword ptr [rdi],rdx  
00007FF82FBA6B13  mov         rdi,rax  
00007FF82FBA6B16  mov         rax,7FF88CF3E07Ch  
00007FF82FBA6B20  cmp         dword ptr [rax],0  
00007FF82FBA6B23  jne         HotPathProfilerSafeArray.EnterSafe()+0D5h (07FF82FBA6B45h)  
            if (prev != 0)
[...] // ommited for brevity

Summary

I run all on Win10 x64 Release build, on a Xeon E-2176G (6-core Coffee Lake) CPU. Assembler output is from Visual Studio, but equals the DisassemblyDiagnoser of BenchmarkDotNet.

Beside the hows and whys on why I'm doing this at all, I simply cannot explain the performance difference between these two methods. It shouldn't be this much I would guess. Can it be BenchmarkDotNet itself? Am I'm missing something else?

Feels like I have a black spot in my understanding about this lowlevel stuff that I'd like to shine some light on...thanks!

PS:

What I've tried so far:

Rearraging order of Benchmark runs
Defer GetTimestamp() call in the slow method
Doing some initialization/test calls before the benchmark run (though I guess that's covered anyways by BenchmarkDotNet)

`Interlocked.Read` may imply a full barrier (against runtime reordering); that could explain using `lock cmpxchg` as a read (the same way 32-bit code would use `lock cmpxchg8b`). Doing an atomic RMW on the memory location seems like an inefficient implementation vs. `mov`-load + `mfence` or something, unless a failed cmpxchg avoids dirtying the cache line and thus lets multiple threads succeed at once. [Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?](https://stackoverflow.com/q/63008857) doesn't mention Intel doing that, but it's possible. — Peter Cordes, May 03 '21 at 11:31
(Of course, even in 32-bit mode, the smart way to atomically read 8 bytes is with MMX or SSE2 `movq xmm0, [mem]`, or at worst `fild`, like GCC -m32 does for std::atomic; that doesn't require a memory barrier) — Peter Cordes, May 03 '21 at 11:31
What CPU hardware did you try this on? e.g. i7-6700k (Skylake)? Or some AMD Zen model? Anyway, a simpler test case for the possible effect here (failed `lock cmpxchg` maybe not dirtying the line) might be to have multiple threads just spin on `Interlocked.Read()` on the same single object, and see if adding more threads can just run one separate cores without disturbing each other. — Peter Cordes, May 03 '21 at 11:37
@PeterCordes that's a very good point you brought up. Since the `cmpxchg-as-read` is implemented as `write-0-expect-0`, the CPU may internally detect the read case and actually executes something similar to `mov`+`mfence`. I tried to find the `Interlocked.Read` implementation in the runtime for enlightening comments, but failed so far. As for the CPU: It's and Intel Xeon E-2176G. The verification method with the simpler test case is great input as well, thanks! — grasbueschel, May 03 '21 at 11:44
As far as I can tell, your examples are not equivalent. The 'fast' version may return -1 even if there are tickets available. While the 'slow' should guarantee a returned value if there is one available. At least assuming tickets are never released. — JonasH, May 03 '21 at 13:01
@JonasH yes, you're right in that they don't do exactly the same - that was intentional btw. Yet in the benchmark case, with only one thread running, that difference shouldn't matter. — grasbueschel, May 03 '21 at 15:41

20x performance difference Interlocked.Read vs Interlocked.CompareExchange though both implemented with lock cmpxchg

Question

Setup

Implementations

SafeArrayCAS (slow)

SafeArray (fast)

Enter rabbit hole

SafeArrayCAS (slow)

SafeArray (fast)

Summary

0 Answers0