I have considered parallelizing a program so that in the first phase it sorts items into buckets modulo the number of parallel workers, so that this avoids collisions in the second phase. Each thread of the parallel program uses std::atomic::fetch_add
to reserve a place in the output array, and then it uses std::atomic::compare_exchange_weak
to update current bucket head pointer. So it's lock free. However, I got doubt about the performance of multiple threads struggling for a single atomic (the one we do fetch_add
, as the bucket head count is equal to the number of threads, thus on average there is not much contention), so I decided to measure this. Here is the code:
#include <atomic>
#include <chrono>
#include <cstdio>
#include <string>
#include <thread>
#include <vector>
std::atomic<int64_t> gCounter(0);
const int64_t gnAtomicIterations = 10 * 1000 * 1000;
void CountingThread() {
for (int64_t i = 0; i < gnAtomicIterations; i++) {
gCounter.fetch_add(1, std::memory_order_acq_rel);
}
}
void BenchmarkAtomic() {
const uint32_t maxThreads = std::thread::hardware_concurrency();
std::vector<std::thread> thrs;
thrs.reserve(maxThreads + 1);
for (uint32_t nThreads = 1; nThreads <= maxThreads; nThreads++) {
auto start = std::chrono::high_resolution_clock::now();
for (uint32_t i = 0; i < nThreads; i++) {
thrs.emplace_back(CountingThread);
}
for (uint32_t i = 0; i < nThreads; i++) {
thrs[i].join();
}
auto elapsed = std::chrono::high_resolution_clock::now() - start;
double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
printf("%d threads: %.3lf Ops/sec, counter=%lld\n", (int)nThreads, (nThreads * gnAtomicIterations) / nSec,
(long long)gCounter.load(std::memory_order_acquire));
thrs.clear();
gCounter.store(0, std::memory_order_release);
}
}
int __cdecl main() {
BenchmarkAtomic();
return 0;
}
And here is the output:
1 threads: 150836387.770 Ops/sec, counter=10000000
2 threads: 91198022.827 Ops/sec, counter=20000000
3 threads: 78989357.501 Ops/sec, counter=30000000
4 threads: 66808858.187 Ops/sec, counter=40000000
5 threads: 68732962.817 Ops/sec, counter=50000000
6 threads: 64296828.452 Ops/sec, counter=60000000
7 threads: 66575046.721 Ops/sec, counter=70000000
8 threads: 64487317.763 Ops/sec, counter=80000000
9 threads: 63598622.030 Ops/sec, counter=90000000
10 threads: 62666457.778 Ops/sec, counter=100000000
11 threads: 62341701.668 Ops/sec, counter=110000000
12 threads: 62043591.828 Ops/sec, counter=120000000
13 threads: 61933752.800 Ops/sec, counter=130000000
14 threads: 62063367.585 Ops/sec, counter=140000000
15 threads: 61994384.135 Ops/sec, counter=150000000
16 threads: 61760299.784 Ops/sec, counter=160000000
The CPU is 8-core, 16-thread (Ryzen 1800X @3.9Ghz). So the total over all threads of operations per second decreases dramatically till 4 threads are used. Then it decreases slowly and fluctuates a bit.
So is this phenomenon common to other CPUs and compilers? Is there any workaround (except resorting to a single thread)?