Why is my double buffer implementation 8x slower on Linux than Windows?

Question

I've written this implementation of a double buffer:

// ping_pong_buffer.hpp

#include <vector>
#include <mutex>
#include <condition_variable>

template <typename T>
class ping_pong_buffer {
public:

    using single_buffer_type = std::vector<T>;
    using pointer = typename single_buffer_type::pointer;
    using const_pointer = typename single_buffer_type::const_pointer;

    ping_pong_buffer(std::size_t size)
        : _read_buffer{ size }
        , _read_valid{ false }
        , _write_buffer{ size }
        , _write_valid{ false } {}

    const_pointer get_buffer_read() {
        {
            std::unique_lock<std::mutex> lk(_mtx);
            _cv.wait(lk, [this] { return _read_valid; });
        }
        return _read_buffer.data();
    }

    void end_reading() {
        {
            std::lock_guard<std::mutex> lk(_mtx);
            _read_valid = false;
        }
        _cv.notify_one();
    }

    pointer get_buffer_write() {
        _write_valid = true;
        return _write_buffer.data();
    }

    void end_writing() {
        {
            std::unique_lock<std::mutex> lk(_mtx);
            _cv.wait(lk, [this] { return !_read_valid; });
            std::swap(_read_buffer, _write_buffer);
            std::swap(_read_valid, _write_valid);
        }
        _cv.notify_one();
    }

private:

    single_buffer_type _read_buffer;
    bool _read_valid;
    single_buffer_type _write_buffer;
    bool _write_valid;
    mutable std::mutex _mtx;
    mutable std::condition_variable _cv;

};

Using this dummy test that performs just swaps, its performances are about 20 times worse on Linux than Windows:

#include <thread>
#include <iostream>
#include <chrono>

#include "ping_pong_buffer.hpp"

constexpr std::size_t n = 100000;

int main() {

    ping_pong_buffer<std::size_t> ppb(1);

    std::thread producer([&ppb] {
        for (std::size_t i = 0; i < n; ++i) {
            auto p = ppb.get_buffer_write();
            p[0] = i;
            ppb.end_writing();
        }
    });

    const auto t_begin = std::chrono::steady_clock::now();

    for (;;) {
        auto p = ppb.get_buffer_read();
        if (p[0] == n - 1)
            break;
        ppb.end_reading();
    }

    const auto t_end = std::chrono::steady_clock::now();

    producer.join();

    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_begin).count() << '\n';

    return 0;

}

Environments of the tests are:

Linux (Debian Stretch): Intel Xeon E5-2650 v4, GCC: 900 to 1000 ms
- GCC flags: -O3 -pthread
Windows (10): Intel i7 10700K, VS2019: 45 to 55 ms
- VS2019 flags: /O2

You may find the code in here in godbolt, with ASM output for both GCC and VS2019 with compiler flags actually used.

This huge gap has been found also in other machines and seems to be due to the OS.

Which could be the reason of this surprising difference?

UPDATE:

The test has been performed also on Linux in the same 10700K, and is still a factor 8 slower than Windows.

Linux (Ubuntu 18.04.5): Intel i7 10700K, GCC: 290 to 300 ms
- GCC flags: -O3 -pthread

If the number of iterations is increased by a factor 10, I get 2900 ms.

How do you make the same binary executable that runs on both OS? Could it be that two binaries are built with two compilers? — Dialecticus, Mar 30 '21 at 10:44
I question the validity of the comparison when it's done on such different hardware. — Dan M., Mar 30 '21 at 10:48
@Dialecticus OP does mention GCC and VS2019 so yes two different compilers are used. — Niteya Shah, Mar 30 '21 at 10:50
@DanM. Especially 2012 era Sandy Bridge at 2-2.8GHz vs 2020 era Commit Lake at 3.8-5.1GHz. Thats just a completely different ball game. — user1937198, Mar 30 '21 at 10:52
The Xeon is a 2.2Ghz processor with boosts up to 2.9 ... the i7 os a 3.2 with boosts up to 5.1... They aren't really comparable. — ChrisMM, Mar 30 '21 at 10:52
Could it be that in GCC the binary is built in what is equivalent of Debug configuration in Visual Studio, instead of in Release? — Dialecticus, Mar 30 '21 at 10:54
@RichardCritten I've added my compiler flags. I can also try to perform the same test running Linux on the 10700K, but think a 20x factor cannot be due to different CPUs. — Giovanni Cerretani, Mar 30 '21 at 10:54
The i7 is not 20x times faster on a single/dual thread basis. The benchmark may not beoptimal but an order of magnitude difference in timing is a valid question. — OutOfBound, Mar 30 '21 at 10:54
Can you join the producer thread before starting the clock? This will reduce the amount of randomness in the meassurement by a great deal. It could help debugging the problem — OutOfBound, Mar 30 '21 at 10:57
@OutOfBound No he can't join the producer thread before starting the clock. If he does so, the producer thread would wait indefinitely. — Fareanor, Mar 30 '21 at 11:14
@HattedRooster You may find the ASM output in the updated godbolt link. — Giovanni Cerretani, Mar 30 '21 at 12:09
gcc exists on Windows. It would be interesting to test gcc on Windows vs. VS19... — Serge Ballesta, Mar 30 '21 at 12:21
@SergeBallesta I did a GCC run on Windows with WSL and a MSVC run. Surely WSL has overhead but MSVC was still 18x faster. That's not normal even with the overhead. — Hatted Rooster, Mar 30 '21 at 12:29
In the 90's, GCC had the reputation of producing much slower executables than *branded* compilers (HP on HP workstations). Was 30 years ago... — Serge Ballesta, Mar 30 '21 at 12:34
It could be that MSVC produces such terrible assembly, that it actually HELPS spacing read and writes, decreasing the contention on the lock. Ironically, slower assembly could produce a faster app \in this very specific case. — David Haim, Mar 30 '21 at 12:45
Most likely caused by `mutex` using user-space polling on Windows (as it is based on `SRWLOCK`) and Linux `mutex` not using any polling, going directly to the kernel. Anyway your test is synthetic and this is not how you measure performance of multi-threaded programs. — , Mar 30 '21 at 14:00
@StaceyGirl I'm not measuring performances, I'm just stating that this mutex related demo is very bad on Linux, and this is a bottleneck for the application I'm developing. — Giovanni Cerretani, Mar 30 '21 at 15:20
@GiovanniCerretani How do you know if it is a bottleneck? Linux mutex in most libraries might be slower for short sections with high contention, but at the same time is more fair. If you are having a performance issues with it, it is likely your problem is high contention on the mutex, not the mutex itself. Also on MacOS mutex is completely fair, so performance (mutex throughput) will go down 10 more times there. — , Mar 30 '21 at 15:23
@StaceyGirl: I checked the disassembly of glibc pthreads `__pthread_mutex_lock` in libpthread.so. There's a `pause` instruction in there (in a loop with `lock cmpxchg`), so there's some user-space spin code in there. IDK if there's a config variable that it checks before using it, though; I didn't try single-stepping into it to see if that spin loop is reached in code built normally, and run without any special env vars. Of course, that's just for taking a lock; condition vars are another level of complication. — Peter Cordes, Mar 30 '21 at 20:54
@StaceyGirl: Update, I did single-step into it, and yeah it looks like it tries to take the lock once, but doesn't spin at all after that, just calling `__lll_lock_wait` which makes a system call. Also interesting that glibc mutex_lock apparently reads from the lock before trying to `lock cmpxchg` - [I thought that was worse](https://stackoverflow.com/questions/63008857/does-cmpxchg-write-destination-cache-line-on-failure-if-not-is-it-better-than) because it might cost a share-request for the read *and then* an RFO (Read For Ownership) for the `lock cmpxchg`. — Peter Cordes, Mar 30 '21 at 21:09
@PeterCordes It reads the mutex type. Since `pthread_mutex_t` can be many things there is a switch-case in every `pthread_mutex_` function. So `pause` you saw is a code for "adaptive" variant - `PTHREAD_MUTEX_ADAPTIVE_NP`. — , Mar 31 '21 at 06:52

score 5 · Answer 1 · answered Mar 30 '21 at 20:15

As Mike Robinson answered, this is likely to do with the different locking implementations on Windows and Linux. We could get a quick idea of the overhead of the feature by profiling how often each implementation switches contexts. I can do the Linux profile, curious if anyone else can try to profile on Windows.

I'm running Ubuntu 18.04 on a Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz CPU

I compiled with g++ -O3 -pthread -g test.cpp -o ping_pong, and I recorded how context switches with this command: sudo perf record -s -e sched:sched_switch -g --call-graph dwarf -- ./ping_pong I extracted a report from the perf counts with this command: sudo perf report -n --header --stdio > linux_ping_pong_report.sched

The report is large, but I'm only interested in this section that shows that about 200,000 context switches were recorded:

# Total Lost Samples: 0
#
# Samples: 198K of event 'sched:sched_switch'
# Event count (approx.): 198860
#

I think that indicates really bad performance, since there in the test, there are n=100000 items pushed & popped to the double buffer, so there is a context switch almost every time we call end_reading() or end_writing(), which is what I'd expect from using std::condition_variable.

score 4 · Answer 2 · answered Mar 30 '21 at 13:56

A problem as great as this one probably has to do with the respective implementations of locking. A profiler should be able to break down the reasons why the process is being forced to wait. The lock semantics and features are not at all the same between these two OSes.

Hatted Rooster · Answer 3 · 2021-03-31T16:47:30.490

As answered by @GandhiGandhi for Linux, I ran the same measurements on Windows 10.

I used Pix to generate a sampling profile for the application ran in x64 Release on MSVC VS 2019. For this sampling profile I filtered out the 2 threads working for contention, shown as:

Then, I exported this file as a .wpix file. Since wpix uses SQLite as storage I opened the exported file with an SQLite browser and queried the ReadyThread table which contains a row of data per context switch based on "readying the thread".

I then ran:

SELECT COUNT(*) FROM ReadyThread

Which gave me 27332. This means that roughly the Windows run for this snippet of code contained about 27 thousand context switches, compared to the 200 thousand observed by Gandhi's answer under Linux this is most likely the culprit for the timing differences you're seeing.

I ran this under Windows 10 Pro Version 20H2 19042.867 on an AMD Ryzen 9 5950X 16-Core @ 3.40 GHz.

This is really interesting! 27k/200k is about 1/8 which just about explains the 8x speedup. I've been looking at the `std::condiiton_variable` documentation for a day, and I haven't been able to find where or why the Windows version can get away with so many fewer context switches. — GandhiGandhi, Apr 02 '21 at 13:23

Why is my double buffer implementation 8x slower on Linux than Windows?

3 Answers3