Why did CRITICAL_SECTION performance become worse on Win8

Question

It seems like CRITICAL_SECTION performance became worse on Windows 8 and higher. (see graphs below)

The test is pretty simple: some concurrent threads do 3 million locks each to access a variable exclusively. You can find the C++ program at the bottom of the question. I run the test on Windows Vista, Windows 7, Windows 8, Windows 10 (x64, VMWare, Intel Core i7-2600 3.40GHz).

The results are on the image below. The X-axis is the number of concurrent threads. The Y-axis is the elapsed time in seconds (lower is better).

What we can see:

SRWLock performance is approximately the same for all platforms
CriticalSection performance became worse relatively SRWL on Windows 8 and higher

The question is: Can anybody please explain why did CRITICAL_SECTION performance become worse on Win8 and higher?

Some notes:

The results on real machines are pretty the same - CS is much worse than both std::mutex, std::recursive_mutex and SRWL on Win8 and higher. However I have no chance to run the test on different OSes with the same CPU.
std::mutex implementation for Windows Vista is based on CRITICAL_SECTION, but for Win7 and higher std::mutex is based on SWRL. It is correct for both MSVS17 and 15 (To make sure search for primitives.h file at MSVC++ installation and look for stl_critical_section_vista and stl_critical_section_win7 classes) This explains the difference between std::mutex performance on Win Vista and others.
As it is said in comments, the std::mutex is a wrapper, so the possible explanation for some overhead relatively SRWL may be overhead introduced by the wrapper code.

#include <chrono>
#include <iostream>
#include <mutex>
#include <string>
#include <thread>
#include <vector>

#include <Windows.h>

const size_t T = 10;
const size_t N = 3000000;
volatile uint64_t var = 0;

const std::string sep = ";";

namespace WinApi
{
    class CriticalSection
    {
        CRITICAL_SECTION cs;
    public:
        CriticalSection() { InitializeCriticalSection(&cs); }
        ~CriticalSection() { DeleteCriticalSection(&cs); }
        void lock() { EnterCriticalSection(&cs); }
        void unlock() { LeaveCriticalSection(&cs); }
    };

    class SRWLock
    {
        SRWLOCK srw;
    public:
        SRWLock() { InitializeSRWLock(&srw); }
        void lock() { AcquireSRWLockExclusive(&srw); }
        void unlock() { ReleaseSRWLockExclusive(&srw); }
    };
}

template <class M>
void doLock(void *param)
{
    M &m = *static_cast<M*>(param);
    for (size_t n = 0; n < N; ++n)
    {
        m.lock();
        var += std::rand();
        m.unlock();
    }
}

template <class M>
void runTest(size_t threadCount)
{
    M m;
    std::vector<std::thread> thrs(threadCount);

    const auto start = std::chrono::system_clock::now();

    for (auto &t : thrs) t = std::thread(doLock<M>, &m);
    for (auto &t : thrs) t.join();

    const auto end = std::chrono::system_clock::now();

    const std::chrono::duration<double> diff = end - start;
    std::cout << diff.count() << sep;
}

template <class ...Args>
void runTests(size_t threadMax)
{
    {
        int dummy[] = { (std::cout << typeid(Args).name() << sep, 0)... };
        (void)dummy;
    }
    std::cout << std::endl;

    for (size_t n = 1; n <= threadMax; ++n)
    {
        {
            int dummy[] = { (runTest<Args>(n), 0)... };
            (void)dummy;
        }
        std::cout << std::endl;
    }
}

int main()
{
    std::srand(time(NULL));
    runTests<std::mutex, WinApi::CriticalSection, WinApi::SRWLock>(T);
    return 0;
}

The test project was built as Windows Console Application on Microsoft Visual Studio 17 (15.8.2) with the folowing settings:

Use of MFC: Use MFC in a Static Library
Windows SDK Version: 10.0.17134.0
Platform Toolset: Visual Studio 2017 (v141)
Optimization: O2, Oi, Oy-, GL

There are some differences in semantics between SRWLock and Critical Section, have a read of: https://stackoverflow.com/questions/3498798/replace-critical-section-with-srw-lock — Richard Critten, Sep 04 '18 at 16:52
I had a quick look at std::mutex implementation in my environment (Win7, VS2015) -- there is one layer of indirection on top of whatever OS primitive chosen by std::mutex (see _Mtx_storage + _Mtx_init_in_situ/etc functions used to operate the primitive). This may explain some of observed performance reduction. — C.M., Sep 04 '18 at 17:22
Your use of `std::rand` makes me worried about thread safety. — Yakk - Adam Nevraumont, Sep 04 '18 at 17:32
_"...It is implementation-defined whether rand() is thread-safe...."_: https://en.cppreference.com/w/cpp/numeric/random/rand It maintains an internal state, what happens if 2 threads try and simultaneously mutate this state. — Richard Critten, Sep 04 '18 at 17:36
@Rom098 It is shared state between threads with semantics undefined by the C++ standard. In code attempting to profile multi-threaded performance. — Yakk - Adam Nevraumont, Sep 04 '18 at 17:37
Do you understand that finding why all this happens will require quite some time? I mean the graphs are cool and all but I doubt anyone can answer such a question offhand, it requires thorough investigation. — ixSci, Sep 04 '18 at 17:44
You might want to also profile the non-lock `struct NoLock{void lock() {} void unlock() {}};` -- see how much of the cost is from `rand()` and how much from locking. — Yakk - Adam Nevraumont, Sep 04 '18 at 17:51
internal implementation of SRW locks and CS changed from one windows version to another. the CS is more complex and containing visible more code/checks compare SRW. from another side your code inside "crit sec" too small. try do say `SwitchToThread()` or `CreateFileW+CloseHandle` - some more time/job inside critical region and compare difference in this case. SRW anyway will be faster, but not so. the *mutex* is shell over SRW, as result always will be bit slow compare it, but may be on vista another implementation, this can explain — RbMm, Sep 04 '18 at 17:56
@Yakk-AdamNevraumont In general the standard says that rand is not thread safe. Do you mean the rand() call under the locked mutex is not thread-safe enough? Anyway I did experiments with “var += 1” and others, and the results are the same. — Rom098, Sep 04 '18 at 18:36
Regarding the discussion surrounding srand/rand, visual c++ uses a thread-local random seed so having two threads executing rand() at the same time will not interfere with each other. But also note this means that each thread needs to call srand to initialize the RNG. — SoronelHaetir, Sep 04 '18 at 18:44
@SoronelHaetir thanks, but the goal of the test is not randomly increase the variable, the goal is time measurement. so if any srand misusing occurs here, it doesn’t affect the test. — Rom098, Sep 04 '18 at 18:54
@SoronelHaetir That sounds better than I feared; I was worried about possible contention. — Yakk - Adam Nevraumont, Sep 04 '18 at 19:23
For reliable results you should probably run the tests on real devices. Maybe you are measuring certain aspects of VMWare performance more than anything else. @ixs: That's understood. Some questions are harder to answer or take more time than others. — IInspectable, Sep 04 '18 at 21:03
@RbMm I tried to insert `std::this_thread::yield()` call under the locked mutex. On Win10 the results for `std::mutex` and `SRWL` are pretty the same, but `CS` is still worse than `std::mutex` 10-25% depending on number of threads. — Rom098, Sep 05 '18 at 09:28
@Rom098 - this is anticipated because CS is more complex. `std::this_thread::yield()` is also too small job. try for example `if (HANDLE hEvent = CreateEvent(0,0,0,0)) { CloseHandle(hEvent); }` — RbMm, Sep 05 '18 at 09:32
I updated the question with some explanations, so the questions 1 and 2 seem answered now. The question 3 is still waiting for an answer. — Rom098, Sep 05 '18 at 11:52
@IInspectable The results on real machines are pretty the same - CS is much worse then both std::mutex, std::recursive_mutex and SRWL on Win8 and higher. However I have no chance to run the test on different OSes with the same CPU. I'd publish the results here, but on the other hand anyone else may start talking about CPU differences, etc. So I guess the publishing results from real machines doesn't make sense. — Rom098, Sep 05 '18 at 12:59
Critical sections are optimized for low contention scenarios. Yours is the exact opposite: Guaranteed continuous contention. — Raymond Chen, Sep 05 '18 at 17:16
@RaymondChen Yes, but this doesn’t explain why CS became worse on Win8. — Rom098, Sep 05 '18 at 18:21
That I cannot explain. Just pointing out that you are using critical sections in a way they were not optimized for. — Raymond Chen, Sep 05 '18 at 19:46
Though `std::mutex` is as wrapper on SRWL, it may perform worse due to its ability to fall-back to implementation that does not use SRWL. Calls to implementation is done using pointers-to-function, and runtime library is complied with security options enabled, so _Control Flow Guard_ chimes in. — Alex Guteniev, Mar 02 '19 at 17:47
@AlexanderGutenev The question was about CriticalSection, not std::mutex. — Rom098, Mar 05 '19 at 14:08
I see, I just explained an observation from Some notes below question. — Alex Guteniev, Mar 05 '19 at 14:16

score 4 · Answer 1 · answered Mar 01 '19 at 17:18

4

See Windows Critical Section - how to disable spinning completely Starting from Windows 8, Microsoft changed implementation (without even a word in documentation) of default behavior of Critical Section (if you use InitializeCriticalSection(&cs), you will get spinning with undocumented dynamic spin adjustment algorithm enabled). See my comment here: https://randomascii.wordpress.com/2012/06/05/in-praise-of-idleness/#comment-57420

For your test, try using InitializeCriticalSectionAndSpinCount(&cs,1) instead of InitializeCriticalSection(&cs). This should make it behave somewhat similar to Windows 7, though there are plenty of other changes in that area.

answered Mar 01 '19 at 17:18

Alexander Safronov

51
2

What are *other changes in that area* you refer to? I know there were a lot of changes throughout history, like adding keyed events, or changing from fair algorithm to unfair, but I don't know any other changes between Windows 7 and Windows 10, except this automatic spin. – Alex Guteniev Mar 02 '19 at 07:23
Actually it's somewhere between `std::mutex` and `CriticalSection` in case I use `InitializeCriticalSectionAndSpinCount(&cs,1)`, but it's still much closer to `CriticalSection`. So your explanation doesn't look like the root cause. – Rom098 Mar 06 '19 at 10:09
@Rom098 any luck solving the mystery? Maybe mark this as the answer. It sheds pretty much light on the case, even if not explaining that in 100%. – quetzalcoatl Jun 26 '20 at 20:11
@quetzalcoatl As I mentioned before, this answer doesn't seem like the root cause. The most possible reason is there was some problem in Windows updates. As far as I know (but I'm not sure) the problem can't be reproduced now with all the latest updates installed. – Rom098 Jul 03 '20 at 13:28
@Rom098 It's actually a quite likely root cause to the bench mark results observed. There is a real shared cacheline in this example, so any algorithm which *doesn't* spin and re-schedules delayed is avoiding several future cache misses as the contention is temporarily gone. And the cost of re-scheduling has reduced a lot compared to Vista, so burning memory transactions on spinning is no longer worth it at all. – Ext3h Sep 28 '21 at 12:04

Why did CRITICAL_SECTION performance become worse on Win8

1 Answers1

Linked