Question
I wanted to write a small profiler class that allows me to measure the run time of hot paths throughout the application. In doing so, I discovered and interesting performance difference between two possible implementations that I cannot explain, but would like to understand.
Setup
The idea is as follows:
// somewhere accessible
public static profiler HotPathProfiler = new HotPathProfiler("some name", enabled: true);
// within programm
long ticket = profiler.Enter();
... // hot path
var result = profiler.Exit(ticket: ticket);
As there aren't much of these hot paths running in parallel, the idea is to implement this via an array that holds the timestamp (0 when slot is free) and returning the index (called ticket) when calling Enter()
. So the class looks like the following:
public class HotPathProfiler
{
private readonly string _name;
private readonly bool _enabled;
private readonly long[] _ticketList;
public HotPathProfiler(string name, bool enabled)
{
_name = name;
_enabled = enabled;
_ticketList = new long[128];
}
}
If code Enter()
s and none of the 128 tickets is available, -1
will be returned which the Exit(ticket)
function can handle by returning early.
When thinking about how to implement the Enter()
call I saw the Interlocked.Read method that can atomically read values on 32bit systems, while, according to the documentation, it is unnecessary on 64bit systems.
So I went on an implemented various types of Enter()
methods, including one with Interlocked.Read
and one with Interlocked.CompareExchange
, and compared them with BenchmarkDotNet. That's where I discovered an enormous performance difference:
| Method | Mean | Error | StdDev | Code Size |
|------------- |----------:|---------:|---------:|----------:|
| SafeArray | 28.64 ns | 0.573 ns | 0.536 ns | 295 B |
| SafeArrayCAS | 744.75 ns | 8.741 ns | 7.749 ns | 248 B |
The benchmark for both look pretty much the same:
[Benchmark]
public void SafeArray()
{
// doesn't matter if 'i < 1' or 'i < 10'
// performance differs by the same factor (approx. 20x)
for (int i = 0; i < 1; i++)
{
_ticketArr[i] = _hpp_sa.EnterSafe();
// SafeArrayCAS:
// _ticketArr[i] = _hpp_sa_cas.EnterSafe();
}
}
Implementations
Again, free slots hold value 0
, occupied slots some other value (timestamp). Enter()
is supposed to return the index/ticket of the slot.
SafeArrayCAS (slow)
public long EnterSafe()
{
if (!_enabled)
{
return -1;
}
long last = 0;
long ts = Stopwatch.GetTimestamp();
long val;
do
{
val = Interlocked.CompareExchange(ref _ticketList[last], ts, 0);
last++;
} while (val != 0 && last < 128);
return val == 0 ? last : -1;
}
SafeArray (fast)
public long EnterSafe()
{
if (!_enabled)
{
return -1;
}
long last = 0;
long val;
do
{
val = Interlocked.Read(ref _ticketList[last]);
last++;
} while (val != 0 && last < 128);
if (val != 0)
{
return -1;
}
long prev = Interlocked.CompareExchange(ref _ticketList[last], Stopwatch.GetTimestamp(), 0);
if (prev != 0)
{
return -1;
}
return last;
}
Enter rabbit hole
Now, one would say that it's no surprise to see a difference, since the slow method always tries to CAS an entry, while the other one only lazily reads each entry and then only tries a CAS once.
But, beside the fact that the benchmark only does 1 Enter()
, i.e. only one while{}
run that shouldn't make that (20x) much difference, it is even harder to explain once you realize the atomic read is implemented as CAS:
SafeArrayCAS (slow)
public long EnterSafe()
{
if (!_enabled)
[...] // ommited for brevity
{
return -1;
[...] // ommited for brevity
}
long last = 0;
00007FF82D048FCE xor edi,edi
long ts = Stopwatch.GetTimestamp();
00007FF82D048FD0 lea rcx,[rsp+28h]
00007FF82D048FD5 call CLRStub[JumpStub]@7ff82d076d70 (07FF82D076D70h)
00007FF82D048FDA mov rsi,qword ptr [rsp+28h]
00007FF82D048FDF mov rax,7FF88CF3E07Ch
00007FF82D048FE9 cmp dword ptr [rax],0
00007FF82D048FEC jne HotPathProfilerSafeArrayCAS.EnterSafe()+0A6h (07FF82D049046h)
long val;
do
{
val = Interlocked.CompareExchange(ref _ticketList[last], ts, 0);
00007FF82D048FEE mov rbx,qword ptr [rsp+50h]
00007FF82D048FF3 mov rax,qword ptr [rbx+10h]
00007FF82D048FF7 mov edx,dword ptr [rax+8]
00007FF82D048FFA movsxd rdx,edx
00007FF82D048FFD cmp rdi,rdx
00007FF82D049000 jae HotPathProfilerSafeArrayCAS.EnterSafe()+0ADh (07FF82D04904Dh)
00007FF82D049002 lea rdx,[rax+rdi*8+10h]
00007FF82D049007 xor eax,eax
00007FF82D049009 lock cmpxchg qword ptr [rdx],rsi
last++;
00007FF82D04900E inc rdi
} while (val != 0 && last < 128);
00007FF82D049011 test rax,rax
00007FF82D049014 je HotPathProfilerSafeArrayCAS.EnterSafe()+084h (07FF82D049024h)
00007FF82D049016 cmp rdi,80h
00007FF82D04901D mov qword ptr [rsp+50h],rbx
00007FF82D049022 jl HotPathProfilerSafeArrayCAS.EnterSafe()+04Eh (07FF82D048FEEh)
SafeArray (fast)
public long EnterSafe()
{
if (!_enabled)
[...] // ommited for brevity
{
return -1;
[...] // ommited for brevity
}
long last = 0;
00007FF82D046C74 xor esi,esi
long val;
do
{
val = Interlocked.Read(ref _ticketList[last]);
00007FF82D046C76 mov rax,qword ptr [rcx+10h]
00007FF82D046C7A mov edx,dword ptr [rax+8]
00007FF82D046C7D movsxd rdx,edx
00007FF82D046C80 cmp rsi,rdx
00007FF82D046C83 jae HotPathProfilerSafeArray.EnterSafe()+0DCh (07FF82D046D2Ch)
00007FF82D046C89 lea rdx,[rax+rsi*8+10h]
00007FF82D046C8E xor r8d,r8d
00007FF82D046C91 xor eax,eax
00007FF82D046C93 lock cmpxchg qword ptr [rdx],r8
last++;
00007FF82D046C98 inc rsi
} while (val != 0 && last < 128);
00007FF82D046C9B test rax,rax
00007FF82D046C9E je HotPathProfilerSafeArray.EnterSafe()+059h (07FF82D046CA9h)
00007FF82D046CA0 cmp rsi,80h
00007FF82D046CA7 jl HotPathProfilerSafeArray.EnterSafe()+026h (07FF82D046C76h)
if (val != 0)
[...] // ommited for brevity
{
return -1;
[...] // ommited for brevity
}
long prev = Interlocked.CompareExchange(ref _ticketList[last], Stopwatch.GetTimestamp(), 0);
00007FF82FBA6ADF mov rcx,qword ptr [rcx+10h]
00007FF82FBA6AE3 mov eax,dword ptr [rcx+8]
00007FF82FBA6AE6 movsxd rax,eax
00007FF82FBA6AE9 cmp rsi,rax
00007FF82FBA6AEC jae HotPathProfilerSafeArray.EnterSafe()+0DCh (07FF82FBA6B4Ch)
00007FF82FBA6AEE lea rdi,[rcx+rsi*8+10h]
00007FF82FBA6AF3 mov qword ptr [rsp+28h],rdi
00007FF82FBA6AF8 lea rcx,[rsp+30h]
00007FF82FBA6AFD call CLRStub[JumpStub]@7ff82d076d70 (07FF82D076D70h)
00007FF82FBA6B02 mov rdx,qword ptr [rsp+30h]
00007FF82FBA6B07 xor eax,eax
00007FF82FBA6B09 mov rdi,qword ptr [rsp+28h]
00007FF82FBA6B0E lock cmpxchg qword ptr [rdi],rdx
00007FF82FBA6B13 mov rdi,rax
00007FF82FBA6B16 mov rax,7FF88CF3E07Ch
00007FF82FBA6B20 cmp dword ptr [rax],0
00007FF82FBA6B23 jne HotPathProfilerSafeArray.EnterSafe()+0D5h (07FF82FBA6B45h)
if (prev != 0)
[...] // ommited for brevity
Summary
I run all on Win10 x64 Release build, on a Xeon E-2176G (6-core Coffee Lake) CPU. Assembler output is from Visual Studio, but equals the DisassemblyDiagnoser of BenchmarkDotNet.
Beside the hows and whys on why I'm doing this at all, I simply cannot explain the performance difference between these two methods. It shouldn't be this much I would guess. Can it be BenchmarkDotNet itself? Am I'm missing something else?
Feels like I have a black spot in my understanding about this lowlevel stuff that I'd like to shine some light on...thanks!
PS:
What I've tried so far:
- Rearraging order of Benchmark runs
- Defer
GetTimestamp()
call in the slow method - Doing some initialization/test calls before the benchmark run (though I guess that's covered anyways by BenchmarkDotNet)