0

I have done a benchmark on some repetitive operations with and without locks in order to understand the performance differences. What bewilders me is:

Working with byte, word, double word, quad word (b,w,d,q) the results are very different and as follows:

Using Ryzen (clock cycles are rounded)

Fetching a value into register, adding an immediate number and storing back, takes 8 clocks for byte or word (8,16bits) and ONLY ONE CYCLE for double word and quad word (32,64 bits)

Adding to memory an immediate or incrementing memory contents takes 6 cycles for byte or word and ONLY ONE CYCLE for double word and quad word (32,64 bits)

Doing the same but prefixing with locks takes in all cases 16 cycles.

USING INTEL ( did tests on my notebook and it is a bit old but you may do your own tests so we may understand what is going on exactly)

Fetching a byte,word,double word,quad word, adding to it an immediate and storing it back takes 6 cycles.

Adding an immediate directly to memory or incrementing memory takes 4 cycles, if done with the lock prefix it takes 8 cycles.

OVERALL IT SEEMS THAT THERE IS A HUGE DIFFERENCE BETWEEN ryzen and the Intel that I checked.

So the question is: Is it true for current Intel models ?

My benchmarking code is on https://godbolt.org/z/ova9dWbsf At the beginning of the code it is copy pasted the map.h file that supports the macro MAP. You may just include it instead of pasting it.

The results of my benchmarks are bellow:

MD Ryzen 5 PRO 4650G with Radeon Graphics

cpuFrequency()3.69321e+09
Bench: ff_fetch_add1immediate_store_indirect_al al time: 7.8
Bench: ff_fetch_add1immediate_store_indirect_ax ax time: 7.8
Bench: ff_fetch_add1immediate_store_indirect_eax eax time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_rax rax time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_dl dl time: 7.8
Bench: ff_fetch_add1immediate_store_indirect_dx dx time: 7.8
Bench: ff_fetch_add1immediate_store_indirect_edx edx time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_rdx rdx time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_edi edi time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_rdi rdi time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_esi esi time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_rsi rsi time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_r8 r8 time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_r9 r9 time: 0.9
Bench: ff_fetch_add1immediate_store_indirect_r14 r14 time: 0.9
Bench: ff_add_immediate_indirect_b addb " $1,(%r15)" time: 6.0
Bench: ff_lock_add_immediate_indirect_b lock addb " $1,(%r15)" time: 15.2
Bench: ff_add_immediate_indirect_w addw " $1,(%r15)" time: 6.0
Bench: ff_lock_add_immediate_indirect_w lock addw " $1,(%r15)" time: 15.4
Bench: ff_add_immediate_indirect_l addl " $1,(%r15)" time: 0.9
Bench: ff_lock_add_immediate_indirect_l lock addl " $1,(%r15)" time: 15.4
Bench: ff_add_immediate_indirect_q addq " $1,(%r15)" time: 0.9
Bench: ff_lock_add_immediate_indirect_q lock addq " $1,(%r15)" time: 15.4
Bench: ff_inc_indirect_b incb " (%r15)" time: 6.1
Bench: ff_lock_inc_indirect_b lock incb "(%r15)" time: 15.3
Bench: ff_inc_indirect_w incw " (%r15)" time: 6.1
Bench: ff_lock_inc_indirect_w lock incw "(%r15)" time: 15.3
Bench: ff_inc_indirect_l incl " (%r15)" time: 0.9
Bench: ff_lock_inc_indirect_l lock incl "(%r15)" time: 15.3
Bench: ff_inc_indirect_q incq " (%r15)" time: 0.9
Bench: ff_lock_inc_indirect_q lock incq "(%r15)" time: 15.5

Intel(R) Pentium(R) CPU N3540 @ 2.16Ghz

cpuFrequency()2.16747e+09
Bench: ff_fetch_add1immediate_store_indirect_al al time: 6.9
Bench: ff_fetch_add1immediate_store_indirect_ax ax time: 6.1
Bench: ff_fetch_add1immediate_store_indirect_eax eax time: 5.6
Bench: ff_fetch_add1immediate_store_indirect_rax rax time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_cl cl time: 6.0
Bench: ff_fetch_add1immediate_store_indirect_cx cx time: 6.1
Bench: ff_fetch_add1immediate_store_indirect_ecx ecx time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_rcx rcx time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_dl dl time: 6.0
Bench: ff_fetch_add1immediate_store_indirect_dx dx time: 6.1
Bench: ff_fetch_add1immediate_store_indirect_edx edx time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_rdx rdx time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_edi edi time: 5.6
Bench: ff_fetch_add1immediate_store_indirect_rdi rdi time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_esi esi time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_rsi rsi time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_r8 r8 time: 5.5
Bench: ff_fetch_add1immediate_store_indirect_r9 r9 time: 5.5
Bench: ff_add_immediate_indirect_b time: 4.3
Bench: ff_lock_add_immediate_indirect_b time: 8.6
Bench: ff_add_immediate_indirect_w time: 4.4
Bench: ff_lock_add_immediate_indirect_w time: 8.6
Bench: ff_add_immediate_indirect_l time: 4.3
Bench: ff_lock_add_immediate_indirect_l time: 8.6
Bench: ff_add_immediate_indirect_q time: 4.2
Bench: ff_lock_add_immediate_indirect_q time: 8.6
Bench: ff_inc_indirect_b time: 4.5
Bench: ff_lock_inc_indirect_b time: 9.4
Bench: ff_inc_indirect_w time: 4.5
Bench: ff_lock_inc_indirect_w time: 9.4
Bench: ff_inc_indirect_l time: 4.5
Bench: ff_lock_inc_indirect_l time: 9.5
Bench: ff_inc_indirect_q time: 4.5
Bench: ff_lock_inc_indirect_q time: 9.4
George Kourtis
  • 2,381
  • 3
  • 18
  • 28
  • Zero-latency store-forwarding in some cases for dword and qword operands is a Zen 2 and Zen 4 feature. Also Ice Lake. https://www.agner.org/forum/viewtopic.php?t=41 has some details about when it happens on Zen 2. https://www.realworldtech.com/forum/?threadid=186393&curpostid=186393 has evidence of Ice Lake having similar memory-renaming, but I forget if that got disabled with a microcode update. – Peter Cordes Aug 29 '23 at 09:47
  • Forcing the CPU to work on non-native data sizes is obvious to be slower. That's why we're using 32/64bit variables to store 1 bit of boolean value for the last 25 years. Speed vs size. Also, Ryzen is a 2020 CPU built for high power (65W) while N3540 is a 2014 CPU built for low power (7W). You're running a 2013 Corvette against a 1960's moped. What did you expect? – Agent_L Aug 29 '23 at 09:51
  • @Agent_L I would expect to have the analogous benchmarks from current models of INTEL cpu's that I do not have access to in order to evaluate better the differences – George Kourtis Aug 29 '23 at 09:54
  • @Agent_L: `sizeof(bool)` is `1` in C ABIs for most ISAs including x86-64. Most code isn't limited by store-forwarding latency so this effect isn't usually a big deal, although it does happen especially in 32-bit code (stack args). Load/store of 8 and 16-bit values are fairly efficient on x86 (about the same as 32 and 64-bit, unlike other ISAs), as long as you use `movzx` for zero-extending loads. [Are there any modern CPUs where a cached byte store is actually slower than a word store?](https://stackoverflow.com/q/54217528) (Apparently most non-x86) – Peter Cordes Aug 29 '23 at 10:04
  • 1
    @Agent_L: Also, Intel does still make Silvermont-family CPUs, such as Tremont-based (https://en.wikipedia.org/wiki/Tremont_(microarchitecture)) Jasper Lake made on a 10 nm process, launched in 2021. The E-cores in Alder Lake / Raptor Lake are Gracemont, the successor to Tremont. https://en.wikipedia.org/wiki/Gracemont_(microarchitecture) lists some E-core-only Alder Lake-N series CPUs, and says Sierra Forest is launching in 2024. Low-power Intel CPUs get used in some budget laptops, and some servers like NAS boxes and just high-density servers. – Peter Cordes Aug 29 '23 at 10:05
  • Anyway, yes, the relevant competitor for AMD Zen 2 and Zen 3 is something like Ice or Alder Lake. https://chipsandcheese.com/2021/12/02/popping-the-hood-on-golden-cove/ has a good rundown on it in general, but doesn't mention memory renaming. If your code mostly bottlenecks on latency of non-atomic read-modify-write of memory, you're probably doing something wrong. (Or you're optimizing a very specific problem, such as histogramming; unrolling over multiple arrays of counts can mitigate the problem of long runs of the same number creating latency bottlenecks too long for OoO exec to hide.) – Peter Cordes Aug 29 '23 at 10:15
  • BTW, the "Using Intel" header in this question should be "Using low-power Intel" or your specific CPU microarchitecture, since you should expect a core optimized for lower clock frequencies will have a shorter pipeline and different latencies for some things. https://uops.info/ has numbers for Alder Lake E and P cores, and some earlier low-power as well as big-core Intel, and AMD, so you could just look at those to check your numbers. `add m32, i32` has a lower bound latency of "<= 0" cycles on Alder Lake, and Zen 2 and Zen 4. But 7-cycle latency on Ice Lake; maybe they disable mem renaming – Peter Cordes Aug 29 '23 at 10:23
  • See [What do multiple values or ranges means as the latency for a single instruction?](https://stackoverflow.com/q/60912850) re: its latency estimates being conservative lower bounds, assuming only 1c latency for other unknown steps in the dependency chain. Also [How to interpret uops.info?](https://stackoverflow.com/q/71418230) re: more general stuff. In the detailed info for latencies, we can see they tested `ADD dword ptr [R14], 16777216` for the memory->memory latency (rather than memory->address), and found 1 cycle on Alder Lake E and P cores. So modern low-power Intel has this feature. – Peter Cordes Aug 29 '23 at 10:24

0 Answers0