On the average modern x64 CPU is cmpxchg16b much slower than its 64 or 32 bit counterparts?

Question

I believe that Windows has been using that instruction internally for a long time now, so it's something CPU manufacturers would have spent effort to optimise?

Of course assuming suitably aligned memory and no sharing of the cache line etc.

I was looking through Agners' tables, but as they are atomic instructions I wasn't sure if in this case they were the right way to look at performance between 128/64/32 bit variants? — iam, Mar 15 '20 at 13:39
From the tables it looks like on a modern Intel processor there is about a 25% performance difference between 8B and 16B LOCK variants. Because of the prevalence of 16B SIMD, and 64B cachelines, I'm not totally sure where most of that performance is being lost conceptually — iam, Mar 15 '20 at 13:50
Agner measures performance for back-to-back uses of the instruction by a single thread, so he's measuring the "hot" / no-contention case. (With an aligned operand). I'd expect contention costs to be about the same regardless of instruction. — Peter Cordes, Mar 15 '20 at 18:42
Perhaps `lock cmpxchg8b` can combine 2 registers into a single 8-byte operand for the ALU instead of having to do extended-precision 2-register stuff in most of the microcode for `cmpxchg16b`? I'm somewhat surprised there's so much difference, though; Intel CPUs only care about 64-byte boundaries for atomicity of cached loads/stores. Fun fact: [Multi-socket K10 has tearing at 8-byte boundaries](https://stackoverflow.com/questions/7646018/sse-instructions-which-cpus-can-do-atomic-16b-memory-operations/7647825#7647825) if you don't use `lock`, but Agner didn't measure CX16 on it :( — Peter Cordes, Mar 15 '20 at 18:47
Note that 64-bit code will generally use `lock cmpxchg qword ptr [mem], reg`, not cmpxchg8b. But if you're comparing the cost of DWCAS (size of 2 pointers) in 32 vs. 64-bit code, then it's `cmpxchg8b` vs. `cmpxchg16b`. — Peter Cordes, Mar 15 '20 at 20:44

score 5 · Accepted Answer · edited Jun 26 '23 at 14:31

Out of curiosity, I wrote a small benchmark to compare the cost of 4- and 8-byte cmpxchg with cmpxchg16b:

#include <cstdint>
#include <benchmark/benchmark.h>

alignas(16) char input[16 * 1024] = {};

template<class T>
void do_benchmark(benchmark::State& state) {
    unsigned n = 0;
    T* p = reinterpret_cast<T*>(input);
    constexpr unsigned count = sizeof input / sizeof(T);
    unsigned i = 0;
    for(auto _ : state) {
        T v{0};
        n += __sync_bool_compare_and_swap(p + i++ % count, v, v);
    }
    benchmark::DoNotOptimize(n);
}

BENCHMARK_TEMPLATE(do_benchmark, std::int32_t);
BENCHMARK_TEMPLATE(do_benchmark, std::int64_t);
BENCHMARK_TEMPLATE(do_benchmark, __int128);
BENCHMARK_MAIN();

And ran it on Coffee Lake i9-9900KS CPU.

Results with gcc-8.3.0:

$ make -rC ~/src/test -j8 BUILD=release run_cmpxchg16b_benchmark
g++ -o /home/max/src/test/release/gcc/cmpxchg16b_benchmark.o -c -pthread -march=native -std=gnu++17 -W{all,extra,error,no-{maybe-uninitialized,unused-function}} -g -fmessage-length=0 -O3 -mtune=native -ffast-math -falign-{functions,loops}=64 -DNDEBUG -mcx16 -MD -MP /home/max/src/test/cmpxchg16b_benchmark.cc
g++ -o /home/max/src/test/release/gcc/cmpxchg16b_benchmark -fuse-ld=gold -pthread -g /home/max/src/test/release/gcc/cmpxchg16b_benchmark.o -lrt /usr/local/lib/libbenchmark.a
sudo cpupower frequency-set --related --governor performance >/dev/null
/home/max/src/test/release/gcc/cmpxchg16b_benchmark
2020-03-15 20:18:48
Running /home/max/src/test/release/gcc/cmpxchg16b_benchmark
Run on (16 X 5100 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 0.43, 0.40, 0.34
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t>       3.53 ns         3.53 ns    198281069
do_benchmark<std::int64_t>       3.53 ns         3.53 ns    198256710
do_benchmark<__int128>           6.35 ns         6.35 ns    110215116

Results with clang-8.0.0:

$ make -rC ~/src/test -j8 BUILD=release TOOLSET=clang run_cmpxchg16b_benchmark
clang++ -o /home/max/src/test/release/clang/cmpxchg16b_benchmark.o -c -pthread -march=native -std=gnu++17 -W{all,extra,error,no-unused-function} -g -fmessage-length=0 -O3 -mtune=native -ffast-math -falign-functions=64 -DNDEBUG -mcx16 -MD -MP /home/max/src/test/cmpxchg16b_benchmark.cc
clang++ -o /home/max/src/test/release/clang/cmpxchg16b_benchmark -fuse-ld=gold -pthread -g /home/max/src/test/release/clang/cmpxchg16b_benchmark.o -lrt /usr/local/lib/libbenchmark.a
sudo cpupower frequency-set --related --governor performance >/dev/null
/home/max/src/test/release/clang/cmpxchg16b_benchmark
2020-03-15 20:19:00
Running /home/max/src/test/release/clang/cmpxchg16b_benchmark
Run on (16 X 5100 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 0.36, 0.39, 0.33
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t>       3.84 ns         3.84 ns    182461520
do_benchmark<std::int64_t>       3.84 ns         3.84 ns    182160259
do_benchmark<__int128>           5.99 ns         5.99 ns    116972653

It looks like cmpxchg16b is around 1.6-1.8x more expensive than 8-byte cmpxchg on Intel Coffee Lake.

Same benchmark on Ryzen 9 5950X and gcc-9.3.0:

Running /home/max/src/test/release/gcc/cmpxchg16b_benchmark
Run on (32 X 4889.51 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 1.11, 0.52, 0.33
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t>       1.58 ns         1.58 ns    436624535
do_benchmark<std::int64_t>       1.58 ns         1.58 ns    443977862
do_benchmark<__int128>           2.22 ns         2.22 ns    316143309

cmpxchg16b is around 1.4x more expensive than 8-byte cmpxchg on AMD Ryzen 9.

Interesting; GCC7 and later don't normally inline `lock cmpxchg16b` for atomic 16-byte load/store, but they still do for `__sync_bool_compare_and_swap` (with `-march=` anything and/or `-mcx16` of course). https://godbolt.org/z/f22mE_. Your pointer increment looks like a reasonable minimal amount of ALU work to do between CAS operations; real code would almost certainly have a memory operation between CASes though, which the full barrier would have to drain. (So that extra fixed overhead can make the relative difference between CX16 and qword CAS smaller.) — Peter Cordes, Mar 15 '20 at 20:42
@PeterCordes Yes, `cmpxchg16b` is only used for older `__sync` builtins, as documentation for `-mcx16` states. This option is used explicitly when compiling for good measure in addition to `-march=native -mtune=native` and no `-latomic` link option. quick-bench fails to link the code with `undefined reference to __sync_bool_compare_and_swap_16` but there is no option to specify compiler options. That remainder operator `%` compiles into `and` instruction, since the array size is choosen to be a large power of 2, https://godbolt.org/z/tqCEYx — Maxim Egorushkin, Mar 15 '20 at 21:02

On the average modern x64 CPU is cmpxchg16b much slower than its 64 or 32 bit counterparts?

1 Answers1