I've written this implementation of a double buffer:
// ping_pong_buffer.hpp
#include <vector>
#include <mutex>
#include <condition_variable>
template <typename T>
class ping_pong_buffer {
public:
using single_buffer_type = std::vector<T>;
using pointer = typename single_buffer_type::pointer;
using const_pointer = typename single_buffer_type::const_pointer;
ping_pong_buffer(std::size_t size)
: _read_buffer{ size }
, _read_valid{ false }
, _write_buffer{ size }
, _write_valid{ false } {}
const_pointer get_buffer_read() {
{
std::unique_lock<std::mutex> lk(_mtx);
_cv.wait(lk, [this] { return _read_valid; });
}
return _read_buffer.data();
}
void end_reading() {
{
std::lock_guard<std::mutex> lk(_mtx);
_read_valid = false;
}
_cv.notify_one();
}
pointer get_buffer_write() {
_write_valid = true;
return _write_buffer.data();
}
void end_writing() {
{
std::unique_lock<std::mutex> lk(_mtx);
_cv.wait(lk, [this] { return !_read_valid; });
std::swap(_read_buffer, _write_buffer);
std::swap(_read_valid, _write_valid);
}
_cv.notify_one();
}
private:
single_buffer_type _read_buffer;
bool _read_valid;
single_buffer_type _write_buffer;
bool _write_valid;
mutable std::mutex _mtx;
mutable std::condition_variable _cv;
};
Using this dummy test that performs just swaps, its performances are about 20 times worse on Linux than Windows:
#include <thread>
#include <iostream>
#include <chrono>
#include "ping_pong_buffer.hpp"
constexpr std::size_t n = 100000;
int main() {
ping_pong_buffer<std::size_t> ppb(1);
std::thread producer([&ppb] {
for (std::size_t i = 0; i < n; ++i) {
auto p = ppb.get_buffer_write();
p[0] = i;
ppb.end_writing();
}
});
const auto t_begin = std::chrono::steady_clock::now();
for (;;) {
auto p = ppb.get_buffer_read();
if (p[0] == n - 1)
break;
ppb.end_reading();
}
const auto t_end = std::chrono::steady_clock::now();
producer.join();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_begin).count() << '\n';
return 0;
}
Environments of the tests are:
- Linux (Debian Stretch): Intel Xeon E5-2650 v4, GCC: 900 to 1000 ms
- GCC flags:
-O3 -pthread
- GCC flags:
- Windows (10): Intel i7 10700K, VS2019: 45 to 55 ms
- VS2019 flags:
/O2
- VS2019 flags:
You may find the code in here in godbolt, with ASM output for both GCC and VS2019 with compiler flags actually used.
This huge gap has been found also in other machines and seems to be due to the OS.
Which could be the reason of this surprising difference?
UPDATE:
The test has been performed also on Linux in the same 10700K, and is still a factor 8 slower than Windows.
- Linux (Ubuntu 18.04.5): Intel i7 10700K, GCC: 290 to 300 ms
- GCC flags:
-O3 -pthread
- GCC flags:
If the number of iterations is increased by a factor 10, I get 2900 ms.