3

I am considering options for implementation of a queue for a project, a requirement of which is that the producer, at least, must be as low latency as possibe. To this end, I have been investigating "lock free" queues using std::atomic to control access to the data structure by the producer and consumer threads. It was my hope that this would avoid overheads in std::mutex, and specifically std::unique_lock, which the code currently uses.

To this end, I have written a simple test program to assess the relative performance of std::mutex (coupled with std::unique_lock) and std::atomic. The program also does a check to ensure the atomic object is lock free, which it is.

#include <mutex>
#include <atomic>
#include <thread>
#include <chrono>
#include <iostream>

#define CYCLES 100000000

void testAtomic()
{
   bool var(true);
   std::atomic_bool _value(true);

   std::cout << "atomic bool is ";

   if(!_value.is_lock_free())
      std::cout << "not ";
   std::cout << "lock free" << std::endl;
   const auto _start_time = std::chrono::high_resolution_clock::now();

   for(size_t counter = 0; counter < CYCLES; counter++)
   {
      var = _value.load();
      var = !var;
      _value.store(var);
   }

   const auto _end_time = std::chrono::high_resolution_clock::now();

   std::cout << 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>
      (_end_time - _start_time).count() << " s" << std::endl;
}

void testMutex()
{
   bool var(true);
   std::mutex _mutex;
   std::chrono::high_resolution_clock _clock;

   const auto _start_time = std::chrono::high_resolution_clock::now();

   for(size_t counter = 0; counter < CYCLES; counter++)
   {
      std::unique_lock<std::mutex> lock(_mutex);
      var = !var;
   }

   const auto _end_time = std::chrono::high_resolution_clock::now();

   std::cout << 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>
      (_end_time - _start_time).count() << " s" << std::endl;
}

int main()
{
   std::thread t1(testAtomic);
   t1.join();
   std::thread t2(testMutex);
   t2.join();
   return 0;
}

When running this program, I get the following output:

atomic bool is lock free
3.49434 s 
2.31755 s 

This would indicate to me that std::mutex (and std::unique_lock) is significantly faster, which is the opposite of what I've come to expect from reading up about atomics vs mutexes. Are my findings correct? Is there a problem with my test program? Is my understanding about the differences between the two incorrect?

The code was compiled with GCC 4.8.5 on CentOS7

Pete
  • 62
  • 4
  • 1
    You should include the compile/link line too. `clang++ -O3 foo.cpp` or whatever. – Eljay Jun 02 '20 at 17:04
  • I used `g++ -std=c++11 testatomic.cpp -lpthread ` – Pete Jun 02 '20 at 17:08
  • @Pete Add `-O3` and use `-pthread` not `-lpthread` – Ted Lyngmo Jun 02 '20 at 17:09
  • @TedLyngmo This speeds things up, however the relative performance numbers are still 'backwards'. With -O3 and -pthread: ```atomic bool is lock free 2.16815 s 1.11816 s ``` – Pete Jun 02 '20 at 17:10
  • Have you tried using the current version of GCC (10.1)? – Eljay Jun 02 '20 at 17:12
  • 2
    @pete what Ted was getting at is profiling unoptimized code is generally a waste of time. Crom only knows what it will be like after the compiler is done with it. If both speed up in roughly the same proportion, OK, but they don't have to. Sometimes the performance changes radically because the compiler is able to do something really sneaky. – user4581301 Jun 02 '20 at 17:12
  • 4
    You could get a different result of you had two threads per test competing for the resource. – Ted Lyngmo Jun 02 '20 at 17:12
  • My benchmark doesn't show the same: http://quick-bench.com/tXFPTKSl5A0AZz1-yvhFjubKLLI – Ted Lyngmo Jun 02 '20 at 17:18
  • I'm not able to reproduce your numbers even with optimization off; the atomic version is still faster. Upgrade your compiler (but also please profile an optimized build). – rustyx Jun 02 '20 at 17:19
  • @TedLyngmo It's my understanding that if implemented properly, atomics should alwasy be faster than mutexes, as they avoid a lot of the overheads inherent in stopping/blocking threads, etc. Are you saying that without another thread competing for access to the shared resource, this may not be the case? – Pete Jun 02 '20 at 17:19
  • 1
    @Pete I'm speculating, but perhaps the compiler notices that it can optimize the mutex away but not the atomic. It's far fetched, but compilers are tricky :-) – Ted Lyngmo Jun 02 '20 at 17:21
  • 2
    `var` in the test with `mutex` is likely optimized out. The `atomic` test seems unlikely to optimize out the boolean operations. It may account for the difference. – François Andrieux Jun 02 '20 at 17:21
  • @FrançoisAndrieux Indeed. In my quick-bench test the atomic version was almost twice as fast. – Ted Lyngmo Jun 02 '20 at 17:22
  • 4
    A mutex in an uncontended state is essentially an atomic. So there would be little difference in the absence of competing threads. The other thing is how you're going to handle the contended state using an atomic; *that* is where many lock-free solutions get bitten hard and end up worse off than the original mutex-based one. – rustyx Jun 02 '20 at 17:26
  • Why would you ever have a function local mutex? – curiousguy Jun 02 '20 at 19:15
  • @rustyx why should a lock-free algorithm perform worse than a lock based one under contention? In almost all cases a lock-free algorithm will provide higher througput. – mpoeter Jun 02 '20 at 20:45
  • 1
    @Pete by default atomic operations are sequentially consistent, but a seq cst store is rather expensive. Try `_value.store(var, std::memory_order_relaxed)`, that should improve performance. – mpoeter Jun 02 '20 at 20:48
  • I assume it's obvious that the `std::atomic` load-change-store version is not functionally equivalent to the `std::mutex` one ? Should you be testing against a `fetch_xor` ? – Chris Hall Jun 03 '20 at 13:00

2 Answers2

2

What hardware are you testing on?

Since you're using GCC, the std::atomic seq_cst store will be using a mov + slow mfence instead of a somewhat less slow xchg-with-mem (which is also a full barrier, like all other x86 atomic RMW operations).

Taking a mutex costs an atomic RMW (like xchg, not mov + mfence). And if you're lucky release the mutex could be just a plain store (like mo_release). There is zero contention so acquiring the lock always succeeds.

It's certainly plausible that the the code behind those mutex lock / unlock library functions is less expensive than mfence, especially on Skylake CPUs with updated microcode where mfence is a full barrier for out-of-order execution as well as memory. (See the bottom of this answer, and also Does lock xchg have the same behavior as mfence?)


Also, note that your mutex loop optimized the local bool var into a register and isn't actually updating it in memory inside the loop. (Your code on the Godbolt compiler explorer with gcc4.8.5).

# the main loop from testMutex
.L80:                                                       # do {
        mov     rdi, rsp                                      # pointer to _mutex on the stack
        call    __gthrw_pthread_mutex_lock(pthread_mutex_t*)
        test    eax, eax
        jne     .L91                                          # mutex error handling
        mov     rdi, rsp                                      # pointer to _mutex again
        call    __gthrw_pthread_mutex_unlock(pthread_mutex_t*)
        sub     rbx, 1
        jne     .L80                                        # }while(--counter)

xor bl, 1 inside the loop would be irrelevant; out-of-order exec could overlap that with other work.

If a reference to var escaped the function so the compiler had to have it in sync in memory before non-inline function calls (including to pthread library functions), we'd expect something like xor byte ptr [rsp+8], 1. That would also be pretty cheap and perhaps mostly hidden by out-of-order exec, although the load/ALU/store could be something that a full barrier would have to wait for when draining the store buffer.


Speeding up your std::atomic code:

You're intentionally avoiding doing an atomic RMW, it seems, instead loading into a tmp var and doing a separate store. If you use only release instead of seq_cst, that lets it compile to just a plain store instruction on x86. (Or to cheaper barriers on most other ISAs).

      bool tmp = _value.load(std::memory_order_relaxed);   // or acquire
      _value.store(!tmp, std::memory_order_release);

This should run at about 6 cycles per inversion, just the latency of one ALU operation plus store-forwarding latency for the store/reload. vs. maybe 33 cycles per iteration for the best-case throughput for mfence (https://uops.info/).

Or since this is a non-atomic modification, just store alternating values without re-reading the old value. You can usually only avoid an atomic RMW in cases where only one producer is writing a value, and other threads are reading. So let the producer keep the value it's modifying in a register (non-atomic local var), and store copies if it.

   bool var = true;
   for(size_t counter = 0; counter < CYCLES; counter++)
   {
      var = !var;
      _value.store(var, std::memory_order_release);
   }

Also, don't use leading underscores for your own var names. Such names are reserved for the implementation. (single _ with lowercase is only reserved at file / global scope, but it's still bad practice.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
0

I was told, on some implementations of mutex initially it will spin lock internally first.

This could be why it seems faster.

If a system call would be made I doubt you would have the same results.

(I can't verify this but I thought this could be a reason)

Moshe Rabaev
  • 1,892
  • 16
  • 31