0

I am experimenting with lock-free algorithms. In some cases, I only need an atomic store() operation and no compare_exchange. I have done simple measurements to get a feel for the performance advantage of store over compare_exchange. The surprise was: compare_exchange is faster!

Why is that, or am I making a measurement error?

This is the measurement, all operations are relaxed:

    using ValueType = unsigned long long;
    constexpr auto MemoryOrder = memory_order_relaxed;
    atomic<ValueType> value{};
    cout << "value.is_lock_free(): " << value.is_lock_free() << endl;
    ValueType oldValue{};
    constexpr auto MAX = 100000000;
//===========================================================
    cout << "value.store: " << MAX << endl;
    auto start = chrono::high_resolution_clock::now();

    for(ValueType newValue{1ULL}; newValue<MAX; ++newValue){
        value.store(newValue, MemoryOrder);
        oldValue = newValue;
    }
    auto end = chrono::high_resolution_clock::now();
    chrono::duration<double> diffStore = end - start;
    cout << "diffStore: " << diffStore.count() << " s" << endl;
//===========================================================
    cout << "value.compare_exchange_weak: " << MAX << endl;
    value.store(0, MemoryOrder);
    oldValue = 0ULL;
    start = chrono::high_resolution_clock::now();

    for(ValueType newValue{1ULL}; newValue<MAX; ++newValue){
        value.compare_exchange_weak(oldValue, newValue, MemoryOrder, MemoryOrder);
        oldValue = newValue;
    }
    end = chrono::high_resolution_clock::now();
    chrono::duration<double> diffCompare = end - start;
    cout << std::setw(9);
    cout << "diffCompare: " << diffCompare.count() << " s" << endl;

    cout << "Store/Compare: " << diffStore / diffCompare << endl;
}

Output:

value.is_lock_free(): 1
value.store: 100.000.000 
diffStore: 5.77747 s //Duration
value.compare_exchange_weak: 100.000.000
diffCompare: 4.44158 s //Duration
Store/Compare: 1.30077
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
gerdi
  • 165
  • 6
  • Store should be way faster *if* you use `mo_release` or `mo_relaxed`, or if you're on AArch64. Otherwise, a seq_cst store may cost about the same as a CAS (basically the same on x86-64). I assume you would have mentioned if you were testing on a platform other than x86-64. – Peter Cordes Apr 08 '22 at 13:03
  • Comparing these things is hard and requires proper profiling tools to understand why something might be faster. But the thing that provides stronger guarantee should in general be slower and if it is not the case then a deep analysis is required. – ixSci Apr 08 '22 at 13:06
  • but I use mo_relaxed, no fences only the atomic variable is stored – gerdi Apr 08 '22 at 13:06
  • Did you compile with optimization enabled? If not benchmarking is useless. In this case, `std::atomic::store()` doesn't inline, so the memory order is a runtime-variable and GCC just picks seq_cst instead of branching to see if it can do something weaker. What compiler, what version, what options, on what CPU model? (And what libc version, although libatomic goes with GCC version.) Also, this isn't a [mcve] so I couldn't copy/paste and test on my own machine if I wanted to. – Peter Cordes Apr 08 '22 at 13:07
  • The repeat count is high enough that CPU frequency warmup should be only a small part of the time for the one you time first. But see [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) – Peter Cordes Apr 08 '22 at 13:10
  • On my Skylake i7-6700k, with `g++ -O3` 11.1, I get Store: 0.0267804 s, Compare: 0.462911 s, for a ratio of **Store/Compare: 0.0578522**. As expected. With optimization disabled, the ratio worsens to 0.842338 (with CAS taking almost 1 second, about 4.5x faster than yours). If you disabled optimization on an older GCC that uses `mov`/`mfence` for seq_cst store (recent GCC switched to `xchg`), that would explain it. (I had to add missing headers and the top of a `main(){`, and a nasty `using namespace std`. https://godbolt.org/z/hh9P96EG9) – Peter Cordes Apr 08 '22 at 13:45
  • See my answer on [Are loads and stores the only instructions that gets reordered?](https://stackoverflow.com/a/50496379) re: Skylake `mfence` being slower than a `lock`ed instruction. – Peter Cordes Apr 08 '22 at 13:52

0 Answers0