I have done a benchmark on some repetitive operations with and without locks in order to understand the performance differences. What bewilders me is:
Working with byte, word, double word, quad word (b,w,d,q) the results are very different and as follows:
Using Ryzen (clock cycles are rounded)
Fetching a value into register, adding an immediate number and storing back, takes 8 clocks for byte or word (8,16bits) and ONLY ONE CYCLE for double word and quad word (32,64 bits)
Adding to memory an immediate or incrementing memory contents takes 6 cycles for byte or word and ONLY ONE CYCLE for double word and quad word (32,64 bits)
Doing the same but prefixing with locks takes in all cases 16 cycles.
USING INTEL ( did tests on my notebook and it is a bit old but you may do your own tests so we may understand what is going on exactly)
Fetching a byte,word,double word,quad word, adding to it an immediate and storing it back takes 6 cycles.
Adding an immediate directly to memory or incrementing memory takes 4 cycles, if done with the lock prefix it takes 8 cycles.
OVERALL IT SEEMS THAT THERE IS A HUGE DIFFERENCE BETWEEN ryzen and the Intel that I checked.
So the question is: Is it true for current Intel models ?
My benchmarking code is on https://godbolt.org/z/ova9dWbsf At the beginning of the code it is copy pasted the map.h file that supports the macro MAP. You may just include it instead of pasting it.
The results of my benchmarks are bellow:
MD Ryzen 5 PRO 4650G with Radeon Graphics
cpuFrequency()3.69321e+09
Bench: ff_fetch_add1immediate_store_indirect_al al time: 7.8
Bench: ff_fetch_add1immediate_store_indirect_ax ax time: 7.8
Bench: ff_fetch_add1immediate_store_indirect_eax eax time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_rax rax time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_dl dl time: 7.8
Bench: ff_fetch_add1immediate_store_indirect_dx dx time: 7.8
Bench: ff_fetch_add1immediate_store_indirect_edx edx time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_rdx rdx time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_edi edi time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_rdi rdi time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_esi esi time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_rsi rsi time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_r8 r8 time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_r9 r9 time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_r14 r14 time: 0.9
Bench: ff_add_immediate_indirect_b addb " $1,(%r15)" time: 6.0
Bench: ff_lock_add_immediate_indirect_b lock addb " $1,(%r15)" time: 15.2
Bench: ff_add_immediate_indirect_w addw " $1,(%r15)" time: 6.0
Bench: ff_lock_add_immediate_indirect_w lock addw " $1,(%r15)" time: 15.4
Bench: ff_add_immediate_indirect_l addl " $1,(%r15)" time: 0.9
Bench: ff_lock_add_immediate_indirect_l lock addl " $1,(%r15)" time: 15.4
Bench: ff_add_immediate_indirect_q addq " $1,(%r15)" time: 0.9
Bench: ff_lock_add_immediate_indirect_q lock addq " $1,(%r15)" time: 15.4
Bench: ff_inc_indirect_b incb " (%r15)" time: 6.1
Bench: ff_lock_inc_indirect_b lock incb "(%r15)" time: 15.3
Bench: ff_inc_indirect_w incw " (%r15)" time: 6.1
Bench: ff_lock_inc_indirect_w lock incw "(%r15)" time: 15.3
Bench: ff_inc_indirect_l incl " (%r15)" time: 0.9
Bench: ff_lock_inc_indirect_l lock incl "(%r15)" time: 15.3
Bench: ff_inc_indirect_q incq " (%r15)" time: 0.9
Bench: ff_lock_inc_indirect_q lock incq "(%r15)" time: 15.5
Intel(R) Pentium(R) CPU N3540 @ 2.16Ghz
cpuFrequency()2.16747e+09
Bench: ff_fetch_add1immediate_store_indirect_al al time: 6.9
Bench: ff_fetch_add1immediate_store_indirect_ax ax time: 6.1
Bench: ff_fetch_add1immediate_store_indirect_eax eax time: 5.6
Bench: ff_fetch_add1immediate_store_indirect_rax rax time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_cl cl time: 6.0
Bench: ff_fetch_add1immediate_store_indirect_cx cx time: 6.1
Bench: ff_fetch_add1immediate_store_indirect_ecx ecx time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_rcx rcx time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_dl dl time: 6.0
Bench: ff_fetch_add1immediate_store_indirect_dx dx time: 6.1
Bench: ff_fetch_add1immediate_store_indirect_edx edx time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_rdx rdx time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_edi edi time: 5.6
Bench: ff_fetch_add1immediate_store_indirect_rdi rdi time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_esi esi time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_rsi rsi time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_r8 r8 time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_r9 r9 time: 5.5
Bench: ff_add_immediate_indirect_b time: 4.3
Bench: ff_lock_add_immediate_indirect_b time: 8.6
Bench: ff_add_immediate_indirect_w time: 4.4
Bench: ff_lock_add_immediate_indirect_w time: 8.6
Bench: ff_add_immediate_indirect_l time: 4.3
Bench: ff_lock_add_immediate_indirect_l time: 8.6
Bench: ff_add_immediate_indirect_q time: 4.2
Bench: ff_lock_add_immediate_indirect_q time: 8.6
Bench: ff_inc_indirect_b time: 4.5
Bench: ff_lock_inc_indirect_b time: 9.4
Bench: ff_inc_indirect_w time: 4.5
Bench: ff_lock_inc_indirect_w time: 9.4
Bench: ff_inc_indirect_l time: 4.5
Bench: ff_lock_inc_indirect_l time: 9.5
Bench: ff_inc_indirect_q time: 4.5
Bench: ff_lock_inc_indirect_q time: 9.4