Why is std::mutex so much worse than std::shared_mutex in Visual C++?

Question

Ran the following in Visual Studio 2022 in release mode:

#include <chrono>
#include <mutex>
#include <shared_mutex>
#include <iostream>

std::mutex mx;
std::shared_mutex smx;

constexpr int N = 100'000'000;

int main()
{
    auto t1 = std::chrono::steady_clock::now();
    for (int i = 0; i != N; i++)
    {
        std::unique_lock<std::mutex> l{ mx };
    }
    auto t2 = std::chrono::steady_clock::now();
    for (int i = 0; i != N; i++)
    {
        std::unique_lock<std::shared_mutex> l{ smx };
    }
    auto t3 = std::chrono::steady_clock::now();

    auto d1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
    auto d2 = std::chrono::duration_cast<std::chrono::duration<double>>(t3 - t2);

    std::cout << "mutex " << d1.count() << "s;  shared_mutex " << d2.count() << "s\n";
    std::cout << "mutex " << sizeof(mx) << " bytes;  shared_mutex " << sizeof(smx) << " bytes \n";
}

The output is as follows:

mutex 2.01147s;  shared_mutex 1.32065s
mutex 80 bytes;  shared_mutex 8 bytes

Why so?

It is unexpected that more rich in features std::shared_mutex is faster than std::mutex, which is strictly a subset in its features.

I wrote my own similar to yours measurement code on my Windows 1.2 Ghz laptop and simple spin-lock works strictly `24 ns` in a loop, std::mutex `75-85 ns`, std::shared_mutex `42-45 ns`. — Arty, Nov 16 '21 at 14:06
@Arty, twice slower than a spinlock -- is the expected mutex perf. your spinlock does `store` with `memory_order_release` on exit, you just need to set it free. but mutex would do an interlocked fetch operation, most likely an `exchange`, to see if some waiting threads need to be notified. (x86 has cheap store with `memory_order_release`, but any `exchange` is not cheap, even `_relaxed`) — Alex Guteniev, Nov 16 '21 at 14:11
Have you looked at the code? What exactly is your question? Have you compared the features the two support to explain their timing difference? Or is it the size difference, which is not unusual at all if you have a handle/body separation. — Ulrich Eckhardt, Nov 16 '21 at 14:11

Alex Guteniev · Accepted Answer · 2021-11-19T15:51:26.323

TL;DR: unfortunate combination of backward compatibility and ABI compatibility issues makes std::mutex bad until the next ABI break. OTOH, std::shared_mutex is good.

A decent implementation of std::mutex would try to use an atomic operation to acquire the lock, if busy, possibly would try spinning in a read loop (with some pause on x86), and ultimately will resort to OS wait.

There are a couple of ways to implement such std::mutex:

Directly delegate to corresponding OS APIs that do all of above.
Do spinning and atomic thing on its own, call OS APIs only for OS wait.

Sure, the first way is easier to implement, more friendly to debug, more robust. So it appears to be the way to go. The candidate APIs are:

CRITICAL_SECTION APIs. A recursive mutex, that is lacking static initializer and needs explicit destruction
SRWLOCK. A non-recursive shared mutex that has static initializer and doesn't need explicit destruction
WaitOnAddress. An API to wait on particular variable to be changed, similar to Linux futex.

These primitives have OS version requirements:

CRITICAL_SECTION existed since I think Windows 95, though TryEnterCriticalSection was not present in Windows 9x, but the ability to use CRITICAL_SECTION with CONDITION_VARIABLE was added since Windows Vista, with CONDITION_VARIABLE itself.
SRWLOCK exists since Windows Vista, but TryAcquireSRWLockExclusive exists since Windows 7, so it can only directly implement std::mutex starting in Windows 7.
WaitOnAddress was added since Windows 8.

By the time when std::mutex was added, Windows XP support by Visual Studio C++ library was needed, so it was implemented using doing things on its own. In fact, std::mutex and other sync stuff was delegated to ConCRT (Concurrency Runtime)

For Visual Studio 2015, the implementation was switched to use the best available mechanism, that is SRWLOCK starting in Windows 7, and CRITICAL_SECTION stating in Windows Vista. ConCRT turned out to be not the best mechanism, but it still was used for Windows XP and 2003. The polymorphism was implemented by making placement new of classes with virtual functions into a buffer provided by std::mutex and other primitives.

Note that this implementation breaks the requirement for std::mutex to be constexpr, because of runtime detection, placement new, and inability of pre-Window 7 implementation to have only static initializer.

As time passed support of Windows XP was finally dropped in VS 2019, and support of Windows Vista was dropped in VS 2022, the change is made to avoid ConCRT usage, the change is planned to avoid even runtime detection of SRWLOCK (disclosure: I've contributed these PRs). Still due to ABI compatibility for VS 2015 though VS 2022 it is not possible to simplify std::mutex implementation to avoid all this putting classes with virtual functions.

What is more sad, though SRWLOCK has static initializer, the said compatibility prevents from having constexpr mutex: we have to placement new the implementation there. It is not possible to avoid placement new, and make an implementation to construct right inside std::mutex, because std::mutex has to be standard layout class (see Why is std::mutex a standard-layout class?).

So the size overhead comes from the size of ConCRT mutex.

And the runtime overhead comes from the chain of call:

library function call to get to the standard library implementation
virtual function call to get to SRWLOCK-based implementation
finally Windows API call.

Virtual function call is more expensive than usually due to standard library DLLs being built with /guard:cf.

Some part of the runtime overhead is due to std::mutex fills in ownership count and locked thread. Even though this information is not required for SRWLOCK. It is due to shared internal structure with recursive_mutex. The extra information may be helpful for debugging, but it does take time to fill it in.

std::shared_mutex was designed to support only systems starting Windows 7. So it uses SRWLOCK directly.

The size of std::shared_mutex is the size of SRWLOCK. SRWLOCK has the same size as a pointer (though internally it is not a pointer).

It still involves some avoidable overhead: it calls C++ runtime library, just to call Windows API, instead of calling Windows API directly. This looks fixable with the next ABI, though.

std::shared_mutex constructor could be constexpr, as SRWLOCK does not need dynamic initializer, but the standard prohibits voluntary adding constexpr to the standard classes.

So swapping a std mutex with a shared makes sense, and is relatively future proof, on windows. — Yakk - Adam Nevraumont, Nov 16 '21 at 14:22
@Yakk-AdamNevraumont, yes. It is likely to become useless in the future, but unlikely to become harmful. However, if you used it with `condition_variable`, it takes `condition_variable_any` to couple with `shared_mutex`, there's no specialized `condition_variable` for `shared_mutex`. — Alex Guteniev, Nov 16 '21 at 14:30
This is a very enlightening Q&A Alex, thanks! I noticed that you printed out the `sizeof` both mutex types in your question. I'm guessing here, but is the size of the `shared_mutex` 8 because it just contains a pointer to a shared control block? — Ted Lyngmo, Nov 16 '21 at 14:36
@TedLyngmo, I've edited the answer to cover that. There's no shared control block. `SRWLOCK` itself has the same size as a pointer (though internally it is not a pointer). `shared_mutex` just contains `SRWLOCK` by value. — Alex Guteniev, Nov 16 '21 at 14:40

Why is std::mutex so much worse than std::shared_mutex in Visual C++?

1 Answers1

Linked