avoid cost of std::mutex when not multi-threading?

Question

Suppose I have an application that may or may not have spawned multiple threads. Is it worth it to protect operations that need synchronization conditionally with a std::mutex as shown below, or is the lock so cheap that it does not matter when single-threading?

#include <atomic>
#include <mutex>

std::atomic<bool> more_than_one_thread_active{false};

void operation_requiring_synchronization() {
    //...
}
void call_operation_requiring_synchronization() {
    if (more_than_one_thread_active) {
        static std::mutex mutex;
        std::lock_guard<std::mutex> lock(mutex);
        operation_requiring_synchronization();
    } else {
        operation_requiring_synchronization();
    }
}

Edit

Thanks to all who have answered and commented, very interesting discussion.

A couple of clarifications:

The application processes chunks of input, and for each chunk decides if it will be processed in a single-threaded or parallel or otherwise concurrent fashion. It is not unlikely that no multi-threading will be needed.

The operation_requiring_synchronization() will typically consist of a few inserts into global standard containers.

Profiling is, of course, difficult when the application is platform-independent and should perform well under a variety of platforms and compilers (past, present and future).

Based on the discussion so far, I tend to think that the optimization is worth it.

I also think the std::atomic<bool> more_than_one_thread_active should probably be changed to a non-atomic bool multithreading_has_been_initialized. The original idea was to be able to turn the flag off again when all threads other than the main one are dormant but I see how this could be error-prone.

Abstracting the explicit conditional away into a customized lock_guard is a good idea (and facilitates future changes of the design, including simply reverting back to std::lock_guard if the optimization is not deemed worth it).

An uncontested mutex is nearly free. The cost of the `if` is probably comparable. — David Schwartz, Aug 31 '15 at 18:16
And if you're considering an alternative to the trivial always-latch-mutex approach with code such as this, you'd better make damn sure that *during* `operation_requiring_synchronization()` another thread cannot possibly *start up* from scratch and enter `call_operation_requiring_synchronization()`, or it will (a) find there is more than one thread running assuming that was set somewhere else, and (b) happily glom on to a mutex that no one else owns, thereby allowing concurrent access to what should be mutually exclusive. — WhozCraig, Aug 31 '15 at 18:20
you might want to look at http://stackoverflow.com/q/11011953/2963099 — Glenn Teitelbaum, Aug 31 '15 at 18:24
A nested mutex lock will block, even if done in one thread. This kind of behavior difference could cause bugs to hide and only appear in multi-thread mode... — Yakk - Adam Nevraumont, Aug 31 '15 at 18:29
@DavidSchwartz, why are you saying so? An uncontested mutex is a kernel call, memory fence and opimization barrier. I am not eager to say it is free. — SergeyA, Aug 31 '15 at 18:40
@SergeyA That might have been true eight years ago or so, but it's not true any more. At least, not on typical platforms. (The optimization barrier is a fair point.) — David Schwartz, Aug 31 '15 at 18:41
@DavidSchwartz, define 'typical'. Also I'd like to know what you are basing your assumption on. Are you saying mutexes are no longer optimization barriers, or are you implying this is no longer a kernel call? — SergeyA, Aug 31 '15 at 18:42
@SergeyA By typical, I mean modern desktop computers. I'm not assuming anything, I'm exercising professional judgment based on years of profiling and optimization experience. And yes, acquiring an uncontended mutex is definitely no longer a kernel call. — David Schwartz, Aug 31 '15 at 18:45
@DavidSchwartz, I'd still like to undestand what is your 'perofessional judgement' based on. I also find it really strange to call 'desktop computers' as 'typical platform' for C++ development. On any rate, I doubt we will be able to find common ground, since we are clearly comming from very different environments. — SergeyA, Aug 31 '15 at 18:47
@DavidSchwartz: Why do you suppose [`HeapAlloc`](https://msdn.microsoft.com/en-us/library/windows/desktop/aa366597%28v=vs.85%29.aspx) in Windows has a `HEAP_NO_SERIALIZE` flag? — user541686, Aug 31 '15 at 19:00
@Mehrdad Because Windows heaps aren't just wrapped with a single mutex lock/unlock call. They have sophisticated optimizations for multi-threaded use that incur costs when they're used from a single thread. It has nothing to do with the issue being discussed here. — David Schwartz, Aug 31 '15 at 19:04
@SergeyA No, an uncontested mutex is NOT a kernel call, at least on Linux. It is done using futex, and "a properly programmed futex-based lock will not use system calls except when the lock is contended". — Severin Pappadeux, Aug 31 '15 at 19:16
I think that the amount and type of work being done in `operation_requiring_synchronization` is germane to your question. For example if `operation_requiring_synchronization` is already doing any sort of I/O then the mutex cost should be trivial next to that. — Mark B, Aug 31 '15 at 19:17
@SeverinPappadeux, on Linux - sure. Linux is not the only OS used, though. — SergeyA, Aug 31 '15 at 19:20
Synchronization is always costly even if its an atomic locked instruction (e.g.: a spinlock). Today's mutexes are a combination of spinlock and kernel lock: The mutex works as a spinlock for a specified number of spins and it turns into a kernel lock only after the specified number of unsuccessful spinlocking - this way the lock tries to adapt to the specific locking pattern of the app: With a lot of lock contention it often behaves as a kernel lock, with less lock contention it is usually only a spinlock without kernel call (but still has a cost that is hardware dependent). — pasztorpisti, Sep 01 '15 at 00:35
There is a contradiction that bugs me: The more threads you have, the more you degrade the performance with a central lock. I'm surprised that you want to optimize the case where you have only 1 thread. You should rather worry about the multi-threaded case. Do you have a reason to optimize for 1 thread? Actually the kernel executed by threads should be lock free whenever possible. That's the base for a good multi-threaded design. Only the job distributor and/or result collector code should use a minimal amount of locking. The single threaded case would execute the lock-free kernel. — pasztorpisti, Sep 01 '15 at 01:00
@Mehrdad Your trivial program isn't proof of anything. The compiler could perform many optimisations. — curiousguy, Sep 01 '15 at 02:14

David Schwartz · Answer 1 · 2019-12-11T14:40:32.453

Generally, optimizations should not be performed in the absence of demonstrated need in your specific use case if they affect the design or organization of code. That's because these kinds of algorithmic optimizations can be very difficult to perform later. Point micro-optimizations can always be added later and should be avoided prior to need for several reasons:

If you guess wrong about the typical use case, they can actually make performance worse.
They can make code harder to debug and maintain.
Even if you guess right about the use case, they can make performance worse on new platforms. For example, mutex acquisition has gotten more than an order of magnitude cheaper in the last eight years. Tradeoffs that make sense today might not make sense tomorrow.
You can wind up wasting time on things that are unnecessary, and worse you can waste time that needed to go into other optimizations. Without enormous amounts of experience, it's very difficult to predict where the actual bottlenecks in your code will be, and even experts are frequently surprised when they actually profile.

This is a classic point micro-optimization, so it should be done only if profiling demonstrates some likely benefit.

Well said, even if it is similar to my answer. There is a big difference between optimal design and optimal implementation detail — Glenn Teitelbaum, Sep 01 '15 at 14:59
Very important corollary to this generally good rule: Optimizations that can be done, should be commented as such up front, and tests be put in place to prevent developers from rendering it un-optimizable through incorrectly located optionality. (For a good example, see how @Mehrdad below had to disable optimizations to prove that mutexes are slow (which they kindof are). I've seen too many projects architected without these sorts of point optimizations in mind.... such that future needed optimizations become massive wastes of time and money. — Erik Aronesty, Nov 01 '19 at 19:56

score 12 · Answer 2 · edited Jun 20 '20 at 09:12

12

Yes, it is worth it.

Underneath your question, David Schwarz commented:

An uncontested mutex is nearly free. The cost of the if is probably comparable.

This is blatantly wrong (but a common misconception).
Try running this:

#include <time.h>

#include <atomic>
#include <mutex>

static std::atomic<bool> single_threaded(true);

int main(int argc, char *argv[])
{
    (void)argv;
    if (argc == 100001) { single_threaded = !single_threaded; /* to prevent compiler optimization later */ }
    int n = argc == 100000 ? -1 : 10000000;
    {
        std::mutex mutex;
        clock_t const begin = clock();
        unsigned int total = 0;
        for (int i = 0; i < n; ++i)
        {
            if (single_threaded)
            {
                total = ((total << 1) ^ i) + ((total >> 1) & i);
            }
            else
            {
                std::lock_guard<std::mutex> lock(mutex);
                total = ((total << 1) ^ i) + ((total >> 1) & i);
            }
        }
        clock_t const end = clock();
        printf("Conditional: %u ms, total = %u\n", (unsigned int)((end - begin) * 1000U / CLOCKS_PER_SEC), total);
    }
    {
        std::mutex mutex;
        clock_t const begin = clock();
        unsigned int total = 0;
        for (int i = 0; i < n; ++i)
        {
            std::lock_guard<std::mutex> lock(mutex);
            total = ((total << 1) ^ i) + ((total >> 1) & i);
        }
        clock_t const end = clock();
        printf("Unconditional: %u ms, total = %u\n", (unsigned int)((end - begin) * 1000U / CLOCKS_PER_SEC), total);
    }
}

My output? (Visual C++)

Conditional: 24 ms, total = 3684292139
Unconditional: 845 ms, total = 3684292139

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 31 '15 at 18:59

user541686

205,094
128
528
886

4

I ran this with g++ 5.0.0 with -O3 and got 0 for both, which ruins the test a little bit. Without optimizations I got 90ms vs. 350ms, but a test that worked with optimizations would have been more valuable. – SirGuy Aug 31 '15 at 19:35
@GuyGreer: try reading the constants from stdin or args and printing the totals. – user541686 Aug 31 '15 at 19:36
I added reading `n` in from stdio and got times of 520 vs 0. Once I added printing `total` at the end of each test the times changed to 620 vs 280. So on my system this optimization is pretty bad and you are better off without the conditional statement. My CPU is Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz as per `cat /proc/cpuinfo` on ubuntu12.04. – SirGuy Aug 31 '15 at 19:55
As Walter Bright once said, "measuring gives you a leg up on experts who are too good to measure" and this is no different. Know your target architecture and measure on that. +1 for actually measuring, though the OP is the one who ultimately needs to measure and find out for himself. – SirGuy Aug 31 '15 at 19:56
Oh yeah and I should mention that when I ran this with g++ 4.8 instead the times were 520 vs 510. – SirGuy Aug 31 '15 at 19:58
7

This is totally unrealistic test code carefully designed to exaggerate the impact as much as possible, and even so, it shows minimal impact (less than 50ns per). Worse, the answer is just totally misleading because it suggests that one can measure the value of a hardware-specific and use-case-specific optimization from artificial test code run on one platform. – David Schwartz Aug 31 '15 at 19:58
2

I was able to reproduce your results on Soalris x86, while on Linux I could only replicate your results with optimization completely turned off. With optimization on the results were quite close, g++ 4.4.6 on both platforms. – Mark B Aug 31 '15 at 21:02
i changed `single_threaded` initialization from `true` to `false` and got: `Conditional: 689 ms | Unconditional: 627 ms` so if mostly you run multithreaded, there is a cost – Glenn Teitelbaum Aug 31 '15 at 21:07
@Mehrdad I have no problem admitting I'm wrong when I'm wrong. But here, you and SergeyA are giving awful advice. – David Schwartz Aug 31 '15 at 21:28
@GuyGreer: That's a good find, and it's probably because Linux's implementation of a mutex is different from Windows's implementation. You should post a separate answer and mention that. (Btw, I updated the test to print the total and do some more arithmetic to inhibit compiler optimizations that wreck the benchmark.) – user541686 Aug 31 '15 at 23:33
@Mehrdad You don't prove that premature micro-optimization is not needed by making the premature optimization and seeing what happens. That precisely inverts the rational order. – David Schwartz Aug 31 '15 at 23:44
1

@DavidSchwartz: Actually, what *really* inverts the rational order is giving a wrong answer and then changing the question to make it right! – user541686 Aug 31 '15 at 23:46
@Mehrdad The question was whether this particular mico-optimization was worth it or whether it was likely to be so inexpensive that the optimization is pointless. This answer completely fails to address the question asked because it doesn't compare the cost to anything, leaving no clue whether it's an optimization that's likely to provide any benefit. A non-zero performance impact on one platform doesn't justify a micro-optimization, especially one that has non-zero costs on other platforms. – David Schwartz Aug 31 '15 at 23:55
@Mehrdad I posted an answer over an hour before you asked me, and it's currently the highest-voted answer. – David Schwartz Sep 01 '15 at 01:09
3

@DavidSchwartz, yeah, this proves everything. You know what - miriads of the house flies can not be wrong, and their diet should be adopted indeed! – SergeyA Sep 01 '15 at 17:59
1

@MarkB: gcc's libstdc++ already sort of does this optimization, checking `__gthread_active_p ()` inside mutex take/release, doing nothing if false. On GNU/Linux (glibc) [it works by checking if you linked with `-pthread` or not](https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libgcc/gthr-posix.h#L214). Single-stepping the asm in GDB on my GCC9.1 Arch Linux desktop shows that it makes no actually function calls, just branching on a register value. But I can't repro it on Godbolt, with g++ or clang with libc++ https://godbolt.org/z/TENrfD – Peter Cordes Dec 11 '19 at 07:06
1

@DavidSchwartz: fun fact: glibc mutexes on Linux already sort of do this optimization for you internally, short circuiting mutex lock/unlock if you didn't build with `-pthread`. (See prev. comment). It doesn't go as far as checking if multiple threads have actually been *started*, though; that could get hairy if you start a thread while you hold a lock. But for manually doing this optimization, if your loop doesn't start threads you should read `single_threaded` into a local var instead of re-reading every time. (Especially with seq_cst, if you care about non-x86 where that takes barriers) – Peter Cordes Dec 11 '19 at 07:10
results on a 4.2GHz i7-6700k Skylake, running Arch GNU/Linux, g++9.1 -O3 : 7ms conditional, same for unconditional if built without `-pthread`. Or with `-pthread`, unconditional becomes ~133ms. So about 20x slower; consistent with the one per 18 + 18 cycle reciprocal throughputs of `lock cmpxchg` (lock) and `lock dec` (unlock) plus some overhead for other stores that those memory-barrier atomic RMWs have to wait for. (Unlike is "fancy" probably because of optionally using futex to wake readers. But at least it's not a system call; that'd be a a hundred or thousand times slower with Spectre) – Peter Cordes Dec 11 '19 at 07:24

Peter Cordes · Answer 3 · 2022-09-02T17:45:54.930

Uncontended locks are not too bad on modern systems, not needing to enter the kernel. But they still involve a full memory barrier (on some ISAs including x86 but maybe not AArch64), and an atomic RMW operation. They're slower than a perfectly-predicted compare/branch.

And being a function call, they defeat some optimizations, e.g. forcing the compiler to spill variables from registers back to memory, including the pointer members of a std::vector control block, introducing extra store/reload latency. (And actually the full memory barrier would defeat store-forwarding).

(Being non-inlinable is how mutex functions actually prevent compile-time reordering on most implementations. So the library function can just do the special stuff in asm to atomically take the lock and prevent runtime reordering. This part involves draining the store buffer on x86, or at least an acquire operation as part of an atomic RMW in general. Taking a lock needs something like was_locked = atomicvar.exchange(1, std::memory_order_acquire) or a CAS, releasing a lock needs a release store. Most libraries will have this part written in asm; I'm using std::atomic to describe what the asm does. x86 atomic-RMW operations are all full barriers, as strong as seq_cst, but AArch64 can do it more weakly. Still, it does impose some memory ordering which might limit out-of-order exec and the store buffer. Especially if you take a lock again right away after releasing, even AArch64 would end up with effectively a memory barrier.)

Depending on how much work you do and how fine-grained your locking is, the cost of an uncontended mutex can be pretty small. But if you're doing it around a every vector::push_back() in a loop, you might see a speedup factor on the order of about 20 for that loop.

(Based on assumptions of one store per 2 or 3 clock cycles on average, which is reasonable assuming some memory-level parallelism and/or cache hits. A push_back loop could even be auto-vectorized and average better than 1 element per clock cycle, assuming small elements and cheap computation of values. lock cmpxchg on Skylake has 1 per 18 cycle throughput with no other memory operations in between; https://agner.org/optimize/. Other microarchitectures, including for non-x86 ISAs, will be different, but about an order of magnitude is probably a good ballpark estimate.)

It might still be a negligible part of your total program run-time, though, and will slightly hurt the multi-thread case by doing extra loads, and another global var that has to stay hot in cache for good performance. And that global var might be in a different cache line from anything else.

If you had a bad thread/mutex library where even the uncontended case entered the kernel, you could be looking at a factor of maybe 400 speedup, or tens of thousands on a modern x86 kernel that uses microcode-assisted Spectre mitigation by flushing the branch-predictors; that takes thousands of cycles every time you enter the kernel. I'd hope there aren't any systems with a kernel modern enough to do that but still using heavy-weight locks.

I think the mainstream OSes (Linux / Mac / Windows) all have lightweight locking that only enters the kernel as a fallback on contention. See Jeff Preshing's Always Use a Lightweight Mutex article. Probably also Solaris and *BSD.

(Cost to enter the kernel at all with syscall on Skylake x86: ~100 to 150 cycles or so, IIRC. With Spectre/Meltdown mitigations on x86, then you change page tables on entry and exit (expensive and potentially leading to TLB misses / page walks) and maybe use a special asm instruction to flush branch prediction.

A system call is also essentially serializing; in a tight user-space loop, it doesn't leave much for out-of-order exec to look at. And there's at least some work within the kernel. (It also destroys any memory-level parallelism you could have had across loop iterations, but a full barrier from a mutex lock already does that.)

So if for some reason you care about bad implementations with very expensive locks even in the uncontended case, you very likely want this. (And probably want the multi-threaded case to be less fine-grained). But such implementations are hopefully not widespread. GNU/Linux is definitely not like this, and AFAIK nothing important is either.

gcc's libstdc++ already sort of does this optimization, checking __gthread_active_p () inside mutex lock/unlock (e.g. __gthread_mutex_lock in /usr/include/c++/9.1.0/x86_64-pc-linux-gnu/bits/gthr-default.h), doing nothing if false. And this is in a header so that wrapper around pthread_mutex_lock can inline into your code.

On GNU/Linux (glibc) it works by checking if you built with g++ -pthread or not. (Checking if the (dynamic) linker gave us a non-zero address for a libpthread private function symbol name, using weak alias stuff. Since this condition is a link-time constant, it doesn't even need to be atomic<> so the compiler can keep the result in a register. It's basically just a load of a non-atomic void*.) libstdc++ on other OSes (not glibc) has other strategies for checking, see the other definitions.

Mehrdad's test-case runs fast even for the Unconditional case, when built without -pthread. ~727ms for the 1000M iterations on Arch GNU/Linux, g++9.1 -O3, glibc 2.29-4, i7-6700k (Skylake) at ~4.2GHz (turbo) with echo performance > energy_performance_preference. That's almost exactly 3 clock cycles per iteration, bottlenecked on the 3 cycle loop-carried dependency chain through total¹. (I bumped up the iteration count from Mehrdad's original instead of using higher-precision timing / printing, partly to hide startup overhead and max-turbo ramp up.)

But with g++ -O3 -pthread so glibc's pthread_mutex_lock and unlock do get called, it's about 18 times slower on Skylake. About 13000ms on my machine, which is about 54 clock cycles / iteration.

The test-case doesn't do any memory access inside the critical section, just
total = ((total << 1) ^ i) + ((total >> 1) & i) on a local unsigned int total which the compiler can keep in a register across the mutex function calls. So the only stores that the lock cmpxchg (lock) and lock dec (unlock) have to drain from the store buffer are the plain stores to other mutex fields, and the return address pushed on the stack by x86's call instruction. This should be somewhat similar to a loop doing .push_back(i) on a std::vector. Per Agner Fog's testing, those locked instructions alone with no other memory access would account for 36 cycles of throughput cost. The actual 54 cycles/iter shows that other work in the lock/unlock functions, and waiting for other stores to flush, has a cost. (Out-of-order exec can overlap the actual total = ... calculation with all this; we know that locked instructions don't block out-of-order exec of independent ALU instructions on Skylake. Although mfence does because of a microcode update to fix an erratum, making gcc's mov+mfence strategy for seq-cst stores instead of xchg like other compilers even worse.)

Footnote 1: At -O3, GCC hoists the if(__gthread_active_p ()) out of the loop, making two versions of the loop. (This is measurably faster than having 3 taken branches inside the loop, including the loop branch itself.)

The "Conditional" version includes a useless load of single_threaded into a register that gets overwritten right away, because nothing happens based on the test. (Compilers don't optimize atomics at all, like volatile, so even an unused load stays. But fortunately x86-64 doesn't need any extra barrier instructions for seq_cst loads so it barely costs anything. Still, over 10 back-to-back runs: Conditional: 728ms pretty consistently. Unconditional: 727ms pretty consistently. vs. a calculated 716ms for 3 cycles/iter at a measured average of 4.19GHz user-space cycles/sec under perf stat -r10 ./a.out.

But at -O2, the branches on __gthread_active_p stay inside the loop:

Conditional: 730 to 750 ms (less stable from run to run than before) with 2 branches per iteration.
Unconditional (no pthread): ~995 ms with 3 taken branches per iteration. Branch mis rate is still 0.00% but they do have a cost for the front-end.
Unconditional (with pthread): ~13100 ms (up from 13000 for -O3 unconditional)

If you compile with gcc -O2, or even at -O3 if the compiler decides not to do loop-multiversioning or inversion or whatever it's called when an if is hoisted, you'll get asm like this:

# g++ 9.1 -O2 for x86-64 on Arch GNU/Linux

    # early in the function, before any loops: load a symbol address into a 
    10de:       48 8b 2d f3 2e 00 00    mov    rbp,QWORD PTR [rip+0x2ef3]        # 3fd8 <__pthread_key_create@GLIBC_2.2.5>
     ...
# "Unconditional" inner loop
    11b8:       48 85 ed                test   rbp,rbp           # do{
    11bb:       74 10                   je     11cd <main+0x13d>  # if( __gthread_active_p () )
      11bd:       4c 89 ef                mov    rdi,r13   # pass a pointer to the mutex in RDI
      11c0:       e8 bb fe ff ff          call   1080 <pthread_mutex_lock@plt>
      11c5:       85 c0                   test   eax,eax
      11c7:       0f 85 f1 00 00 00       jne    12be <main+0x22e>  # if non-zero retval: jump to a call std::__throw_system_error( eax ) block
    11cd:       43 8d 04 24             lea    eax,[r12+r12*1]    # total<<1 = total+total
    11d1:       41 d1 ec                shr    r12d,1             # shifts in parallel
    11d4:       31 d8                   xor    eax,ebx
    11d6:       41 21 dc                and    r12d,ebx           # xor, and with i
    11d9:       41 01 c4                add    r12d,eax           # add the results: 3 cycle latency from r12 -> r12 assuming perfect scheduling
    11dc:       48 85 ed                test   rbp,rbp
    11df:       74 08                   je     11e9 <main+0x159>  # conditional skip mov/call
      11e1:       4c 89 ef                mov    rdi,r13
      11e4:       e8 77 fe ff ff          call   1060 <pthread_mutex_unlock@plt>
    11e9:       83 c3 01                add    ebx,0x1
    11ec:       81 fb 80 96 98 00       cmp    ebx,0x989680
    11f2:       75 c4                   jne    11b8 <main+0x128>  # }while(i<10000000)

I can't repro this code-gen on Godbolt with g++, or clang with libc++. https://godbolt.org/z/kWQ9Rn Godbolt's install of libstdc++ maybe doesn't have the same macro defs as a proper install?

call __gthrw_pthread_mutex_lock(pthread_mutex_t*) isn't inlining so we can't see the effect of the if (!__gthread_active_p ()) check.

Make your check efficient if you do this

If you're the only thread running, that won't change unless your loop starts threads.

You can make the variable non-atomic. Set it right before you start any threads, then never write it again. All threads can then just read it into a register across loop iterations. And compilers can even hoist the check out of loops for you. (Like gcc -O3 does for the branch inside the GCC mutex implementation as described above, but not at -O2).

You can manually hoist it out of a loop instead of letting compilers branch on a loop-invariant register value after hoisting the load of a non-atomic variable. If manually hoisting helps your compiler make a loop significantly faster, might as well go all-in on this optimization:

// global scope
bool multi_threaded = false;   // zero init lets this go in the BSS

// in a function
if (!multi_threaded) {
 // optionally take a lock here, outside an inner loop            std::lock_guard<std::mutex> lock(mutex);
    for (int i = 0; i < n; ++i) {
       stuff;
    }
} else {
    for (int i = 0; i < n; ++i) {
       std::lock_guard<std::mutex> lock(mutex);
       stuff;
    }
}

Pull the loop body out into a function to avoid duplication if it's more than trivial.

// starting threads
multi_threaded = true;
std::thread t(stuff);

If you want to ever return to single-threaded mode, you can do that safely to at some point when you know you're the only thread:

t.join();
multi_threaded = false;    // all threads that could be reading this are now done
                           // so again it can be safely non-atomic

You could even have multi_threaded variables for different data structures, to track whether there were multiple threads that might possibly look at a certain data structure. At that point you could think about making them atomic. Then you'd want bool nolocks = some_container.skip_locking.load(std::memory_order_relaxed); and use the same local for the whole loop.

I haven't thought this through carefully, but I think that works as long as no other thread will set some_container.skip_locking and start another thread that accesses it; that wouldn't be safe anyway because this thread might be in the middle of modifying a data structure without holding a lock.

You could even treat the flag like "coarse locking" instead of "no locking" so it still works if another thread wants to start using a data structure; the time from starting a new thread to when it can actually acquire a lock for this data structure might be significant if we hold the lock across a huge number of iterations.

 if (!some_container.fine_locking.load(std::memory_order_relaxed)) {
     // take a lock here, outside an inner loop
     std::lock_guard<std::mutex> lock(mutex);
     for (int i = 0; i < n; ++i) {
         some_container.push_back(i);
     }
 } else {
     // lock *inside* the loop.
     for (int i = 0; i < n; ++i) {
         std::lock_guard<std::mutex> lock(mutex);
         some_container.push_back(i);
     }
 }

This could easily get pretty hairy, this is just brainstorming what's possible, not what's a good idea!

Very informative answer! It's well worth a web article. :-) As it suggests that the uncontended mutex aren't worth worrying, it partly answers my question [How to minimize the mutex locking for an object when only 1 thread mostly uses that object and the other thread(s) use it rarely?](https://stackoverflow.com/q/59526616/514235). You may consider posting an answer there. — iammilind, Dec 30 '19 at 15:09
@iammilind: *As it suggests that the uncontended mutex aren't worth worrying* That's the opposite of what I'm saying. I show a benchmark where it leads to a ~18x slowdown when used around `.push_back` on a std::vector, with current GCC + glibc on a Skylake CPU. If your critical sections are tiny enough, then yes it is worth worrying about. And yeah, I started writing an answer to your question. If I get back to it I'll collect up my comments there into an answer. — Peter Cordes, Dec 30 '19 at 15:55

score 4 · Answer 4 · answered Aug 31 '15 at 18:49

4

I disagree with wide-spread idea that locking mutext is cheap. If you really are after performance, you wouldn't want to do this.

Mutexes (even uncontested) hit you with three hummers: they penalize compiler optimizations (mutexes are optimization barriers), they incure memory fences (on un-pessimized platforms) and they are kernel calls. So if you are after nanoseconds performance in tight loops, it is something worth considering.

Branching is not great, either - for multiple reasons. The real solution is to avoid operatations requiring synchronization in multithread environment. As simple as that.

answered Aug 31 '15 at 18:49

SergeyA

61,605
5
78
137

Yeah, it's definitely not as cheap as people love to think. +1 – user541686 Aug 31 '15 at 18:56
1

@Mehrdad, I assume (I actually have proof in form of comments above) that people are basing their assumptions on a usage example desktop PCs running word processor. Yes, in this case mutex costs are 100% neglectible. But why bother with C++ in this scenario to begin with? Managed languages would be much more suitable. – SergeyA Aug 31 '15 at 18:59
Yeah I've seen the comments above. Not entirely sure I understand what it's based on, but I've definitely gained performance by removing mutex operations from my single-threaded code before... – user541686 Aug 31 '15 at 19:02
3

@SergeyA On what modern platforms are acquires and releases of uncontested mutexes kernel calls? – David Schwartz Aug 31 '15 at 19:05
@DavidSchwartz, Solaris, for instance. – SergeyA Aug 31 '15 at 19:19
1

@SergeyA Solaris implemented uncontended mutex acquisition within a process without a kernel call (using atomic test-and-set) in 1998 -- 17 years ago. – David Schwartz Aug 31 '15 at 19:51
@DavidSchwartz I was under the impression that Solaris didn't have it. Nevertheless, it is a minor point in my observation, and what's even more important, I asusme, such a platform can be found. Moreover, memory fencing and no optimization barrier are still there. I suggest (especially given the performance measurement done above) you admit you were not given author the base advice. – SergeyA Aug 31 '15 at 20:59
4

@SergeyA Someone asks a very generic question and we should base our answers on what an implementation that might happen to exist somewhere might happen to be like? Avoid a commonly-used, standardized class because someone somewhere might have implemented it badly?! This is not a complicated question -- it's basically, "Should I implement a small micro-optimization without demonstrated need", and the answer is simple too -- "no". – David Schwartz Aug 31 '15 at 21:30
1

@DavidSchwartz RMWs are more expensive than conditionals in general though. Strict memory ordering is another pessimization as well. There is another question of whether the cost is *negligible* though. – Jason Aug 31 '15 at 21:50
@Jason True. You get an RMW and you do have a barrier that memory optimizations cannot pass. However, over the last few years, massive effort has been put into minimizing the cost of these operations as multi-core systems have gone from rare to the norm. They used to require context switches, lock physical buses, and empty pipelines. They do not do these things anymore. – David Schwartz Aug 31 '15 at 22:24
@DavidSchwartz They do generally still require noisy MESI messages (RFO) though which is somewhat a fundamental limitation. Java actually explicitly does escape analysis to elide locks. It's doubtful the optimization would benefit AoT compiled code like C/C++ though (unless maybe the loader were capable of the optimization). – Jason Aug 31 '15 at 22:59
@Jason An uncontended mutex requires no MESI messages. The lock can remain resident exclusive in the cache of the core that's running the code. It's no different from a regular read/write to memory. – David Schwartz Aug 31 '15 at 23:01
@DavidSchwartz The cache line *could* be exclusive. – Jason Aug 31 '15 at 23:03
@DavidSchwartz It's the entire cache line though, not just the integer. – Jason Aug 31 '15 at 23:05
@DavidSchwartz Not necessarily. If anything on the cache line has been modified, or anything on the cache line is shared, the RFO is *very* expensive. An already exclusive cache line is vastly less expensive though. – Jason Aug 31 '15 at 23:14
@Jason This is precisely the same for a normal read or write to memory. – David Schwartz Aug 31 '15 at 23:16
@DavidSchwartz Not exactly, an RFO on a modified line isn't always the same as a normal write. It could actually require a flush to memory. – Jason Aug 31 '15 at 23:21
@Jason We're back in history again. It hasn't been like that on Intel CPUs since the Pentium Pro. – David Schwartz Aug 31 '15 at 23:23
@DavidSchwartz It's news for me, but not everything is necessarily Intel. – Jason Aug 31 '15 at 23:33
1

@Jason Right. This is kind of a platform-specific question. – David Schwartz Aug 31 '15 at 23:38

score 2 · Answer 5 · answered Aug 31 '15 at 22:49

2

You are on the right track - write the functional part withot synchronization and add it externally, if and when needed.

Instead of the explicit if-block I would still instantiate the lock, and hide the complexity in there.

template <class Mutex>
struct faster_lock{
  faster_lock(Mutex& mutex) lock here, possibly with nested RAII {}
  ~faster_lock()noexcept { unlock here, or nested RAII }
};

{
  faster_lock lock(mutex);
  operation_requiring_synchronization();
}

And the last note - if you have atomic flag anyway you can just turn it into a spinlock and keep your logic simpler.

answered Aug 31 '15 at 22:49

bobah

18,364
2
37
70

Hiding the complexity is definitely the way to go. You could take this a step further using a policy to define the mutex type which could no-op on lock/unlock or the lock guard which could no-op on the constructor/destructor (assuming RAII is in effect). – Andrew Sep 01 '15 at 03:22
Rolling your own spinlock is usually a terrible idea, and would defeat the purpose of not doing any atomic RMW operations in the single-thread case. An uncontended mutex is about the same thing on a good thread library, like GCC with libstc++ / glibc. (Although something that can inline might help.) – Peter Cordes Dec 11 '19 at 07:31
@PeterCordes - you have benchmark results (not that spinlock was the key point of the answer anyway). – bobah Dec 30 '19 at 14:20
Yes, I did single-step into the asm of glibc's `pthread_mutex_lock` and unlock to see that it doesn't do too much beyond a `lock cmpxchg`, at least in the uncontended case when that succeeds. I also did some testing with Mehrdad's microbenchmark [in my answer](https://stackoverflow.com/questions/32317370/avoid-cost-of-stdmutex-when-not-multi-threading/59286955#59286955) – Peter Cordes Dec 30 '19 at 14:26

score 1 · Answer 6 · answered Aug 31 '15 at 21:32

Yes, often avoiding an unnecessary lock with a conditional will improve performance simply because a mutex will normally rely on an RMW or entering the kernel, both of which are relatively expensive to a simple branch. See the double-checked locking idiom for an example of another scenario where avoiding locks can be beneficial.

However, you always want to consider the cost to benefit. Multi-threaded bugs can creep in when you start special casing for single and multi-threaded code, which can suck to track down. The other thing to consider is that while there may be a measurable difference between eliding the lock and not, it might not be a measurable impact on the software as a whole. So measure, but measure intelligently.

Glenn Teitelbaum · Answer 7 · 2015-09-01T14:57:34.473

0

In general it is possible that it is cheap enough to not worry about it until you are done

When you are done, then you can profile it both ways and see the impact.

Keep in mind you will have to profile the effect for both single and multi-threaded. It might effect multi-threaded as well.

#ifdef USE_CONDITIONAL_GUARDED_MUTEX
std::atomic<bool> more_than_one_thread_active{false};
#else
static const bool more_than_one_thread_active{true}; // always use mutex
#endif

You might want to consider making this a compile time option, and have a single and multi-threaded version of your binary, that way no if is needed

#ifdef SINGLE_THREADED_WITHOUT_MUTEX
static const bool more_than_one_thread_active{false}; // never use mutex
#else
static const bool more_than_one_thread_active{true}; // always use mutex
#endif

Almost every optimizer will remove code surrounded by a const bool based on its value

edited Sep 01 '15 at 14:57

answered Aug 31 '15 at 18:52

Glenn Teitelbaum

10,108
3
36
80

4

There are two problems with the suggestion 'code first, profile later'. The first is that later is not defined and sometimes never arrives, the second is that when it arrives, it shows that the whole design might need to be redone to achieve acceptable performance. – SergeyA Aug 31 '15 at 18:56
@SergeyA that is the other side of premature optimization – Glenn Teitelbaum Aug 31 '15 at 18:59
3

@SergeyA That would be a sensible thing to say if we were talking about algorithmic optimizations and optimizations that affect the design of the code. But here, we're talking about a micro-optimization that has nothing to do with the structure or organization of code. – David Schwartz Aug 31 '15 at 21:29

Nick Giampietro · Answer 8 · 2022-08-29T16:48:29.997

As with all things in software the answer is, unhelpfully, "it depends."

A little more helpfully: It differs across hardware platforms of course, but generally, the cost of locking is likely to be significantly higher (proportionally) than a conditional block. However, a reasonable argument can be made for the idea that if you're locking somewhere that this cost matters, you may have locked in the wrong place. This depends a lot on the use case. And while there are use cases where a lock needs to be used in a critical path, it's not clear from your question whether that is the case.

Here are some examples to illustrate what I mean:

If operation_requiring_synchronization() is a costly function operating on a large buffer, such as convolving a 4k image, the cost of a lock vs if statement is probably going to be lost in the noise of the several million computations taking place on your inner loop. Probably wiser to opt for simple/readable code, and focus on optimizing your convolution code.
If operation_requiring_synchronization() is relatively quick, such as inserting something into a collection shared across threads (e.g. network messages on a closed system?), and call_operation_requiring_synchronization() is called many times in a loop, it may indeed make sense to use a conditional to avoid the extra cost, so you can improve throughput.

This is where David Schwartz is 100% right -- you really ought to contemplate your use case, and measure whether this code you're considering "micro optimizing" will actually impact runtime in a meaningful way.

avoid cost of std::mutex when not multi-threading?

8 Answers8

Yes, it is worth it.

Make your check efficient if you do this

Linked