3

I have considered parallelizing a program so that in the first phase it sorts items into buckets modulo the number of parallel workers, so that this avoids collisions in the second phase. Each thread of the parallel program uses std::atomic::fetch_add to reserve a place in the output array, and then it uses std::atomic::compare_exchange_weak to update current bucket head pointer. So it's lock free. However, I got doubt about the performance of multiple threads struggling for a single atomic (the one we do fetch_add, as the bucket head count is equal to the number of threads, thus on average there is not much contention), so I decided to measure this. Here is the code:

#include <atomic>
#include <chrono>
#include <cstdio>
#include <string>
#include <thread>
#include <vector>

std::atomic<int64_t> gCounter(0);
const int64_t gnAtomicIterations = 10 * 1000 * 1000;

void CountingThread() {
  for (int64_t i = 0; i < gnAtomicIterations; i++) {
    gCounter.fetch_add(1, std::memory_order_acq_rel);
  }
}

void BenchmarkAtomic() {
  const uint32_t maxThreads = std::thread::hardware_concurrency();
  std::vector<std::thread> thrs;
  thrs.reserve(maxThreads + 1);

  for (uint32_t nThreads = 1; nThreads <= maxThreads; nThreads++) {
    auto start = std::chrono::high_resolution_clock::now();
    for (uint32_t i = 0; i < nThreads; i++) {
      thrs.emplace_back(CountingThread);
    }
    for (uint32_t i = 0; i < nThreads; i++) {
      thrs[i].join();
    }
    auto elapsed = std::chrono::high_resolution_clock::now() - start;
    double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
    printf("%d threads: %.3lf Ops/sec, counter=%lld\n", (int)nThreads, (nThreads * gnAtomicIterations) / nSec,
      (long long)gCounter.load(std::memory_order_acquire));

    thrs.clear();
    gCounter.store(0, std::memory_order_release);
  }
}

int __cdecl main() {
  BenchmarkAtomic();
  return 0;
}

And here is the output:

1 threads: 150836387.770 Ops/sec, counter=10000000
2 threads: 91198022.827 Ops/sec, counter=20000000
3 threads: 78989357.501 Ops/sec, counter=30000000
4 threads: 66808858.187 Ops/sec, counter=40000000
5 threads: 68732962.817 Ops/sec, counter=50000000
6 threads: 64296828.452 Ops/sec, counter=60000000
7 threads: 66575046.721 Ops/sec, counter=70000000
8 threads: 64487317.763 Ops/sec, counter=80000000
9 threads: 63598622.030 Ops/sec, counter=90000000
10 threads: 62666457.778 Ops/sec, counter=100000000
11 threads: 62341701.668 Ops/sec, counter=110000000
12 threads: 62043591.828 Ops/sec, counter=120000000
13 threads: 61933752.800 Ops/sec, counter=130000000
14 threads: 62063367.585 Ops/sec, counter=140000000
15 threads: 61994384.135 Ops/sec, counter=150000000
16 threads: 61760299.784 Ops/sec, counter=160000000

The CPU is 8-core, 16-thread (Ryzen 1800X @3.9Ghz). So the total over all threads of operations per second decreases dramatically till 4 threads are used. Then it decreases slowly and fluctuates a bit.

So is this phenomenon common to other CPUs and compilers? Is there any workaround (except resorting to a single thread)?

Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
  • 1
    Synchronisation always has an overhead. Amdahl's law tells us thet parallelisation won't solve all your problems. If you can run multiple single-thread processes and do the same job, you should probably try that. – Rook Jun 15 '17 at 07:36
  • I don't see any practical benefit (in this case) by increasing the number of threads to a count greater than 4. The reason could be that your CPU would be spending more and more time in context switching/synchronising the threads when the number of threads is increasing. – Abdus Khazi Jun 15 '17 at 07:37
  • 2
    @AbdusSalamKhazi the program isn't doing _anything_ other than switching and synchronising. – Rook Jun 15 '17 at 07:42
  • Note that a single-threaded implementation of this would not use atomics at all, and be *much* faster than the 1-thread version of this. `for (int64_t i = 0; i < iters; i++) { non_atomic_count++; }` could keep the counter in a register for the whole loop, and only store it afterwards. A compiler could even optimize it to `non_atomic_count += iters`. (Actually, compilers [are allowed to do that with your atomic version](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html), but currently don't.) – Peter Cordes Jul 02 '17 at 08:53
  • Anyway, like everyone else has said, you're just measuring that high-contention synchronization overhead goes up with more threads. There has to be some independent work that each thread can do for parallelizing to be worth it at all, however you synchronize. – Peter Cordes Jul 02 '17 at 08:55

3 Answers3

4

A lock free multi threaded program is not slower than a single threaded program. What makes it slow is data contention. The example you provided is in fact a highly contentious artificial program. In a real program you will do a lot of work between each access to shared data and thus it will have less cache invalidations and so on. This CppCon talk by Jeff Preshing can explain some of your questions better than I did.

Add: Try to modify CountingThread and add a sleep once in a while to pretend you are busy with something else than incrementing atomic variable gCounter. Then go ahead and play with value in the if statement to see how it will influence results of your program.

void CountingThread() {
  for (int64_t i = 0; i < gnAtomicIterations; i++) {
    // take a nap every 10000th iteration to simulate work on something
    // unrelated to access to shared resource
    if (i%10000 == 0) {
        std::chrono::milliseconds timespan(1);
        std::this_thread::sleep_for(timespan);
    }
    gCounter.fetch_add(1, std::memory_order_acq_rel);
  }
}

In general every time you call gCounter.fetch_add it means marking that data invalid in other core's cache. It is forcing them to reach for the data into a cache further from the core. This effect is major contributor to performance slowdown in your program.

local  L1 CACHE hit,                              ~4 cycles (   2.1 -  1.2 ns )
local  L2 CACHE hit,                             ~10 cycles (   5.3 -  3.0 ns )
local  L3 CACHE hit, line unshared               ~40 cycles (  21.4 - 12.0 ns )
local  L3 CACHE hit, shared line in another core ~65 cycles (  34.8 - 19.5 ns )
local  L3 CACHE hit, modified in another core    ~75 cycles (  40.2 - 22.5 ns )

remote L3 CACHE (Ref: Fig.1 [Pg. 5])        ~100-300 cycles ( 160.7 - 30.0 ns )

local  DRAM                                                   ~60 ns
remote DRAM                                                  ~100 ns

Above table taken from Approximate cost to access various caches and main memory?

Lock-free doesn't mean you can exchange data between threads without cost. Lock-free means that you don't wait for other threads to unlock mutex for you to read shared data. In fact even lock-free programs use locking mechanisms to prevent data corruption.

Just follow simple rule. Try to access shared data as less as possible to gain more from multicore programming.

Marek Vitek
  • 1,573
  • 9
  • 20
  • 2
    Your first sentence is too broad. Lock-free atomic operations are still slower than the non-atomic operations a single-threaded program can use. e.g. adding entries to a lock-free hash table is more expensive than adding to a simple hash table. A non-shared counter can be in a register, where incrementing it costs [1 cycle of latency instead ~17 for `lock add` on Ryzen](http://agner.org/optimize). Plus, you're assuming that the lock-free program is optimally designed to have as little synchronization as possible. That takes a lot of work. – Peter Cordes Jul 02 '17 at 08:27
  • Not to mention that the compiler can optimize better when there aren't any reordering barriers that require values to be in memory at certain points. Anyway, for problems where it makes sense to parallelize, a good lock-free implementation on a multi-core CPU is often great. But a bad lock-free implementation can certainly lose, especially if the problem's parallelism is very fine-grained so it requires a lot of synchronization. – Peter Cordes Jul 02 '17 at 08:32
  • Q: Is my porsche slower than my friend's minivan? A: No porsche is not slower than your friend's minivan. The reason why is your friend beating you during race is that you are stopping every 50m to check tire pressure. @PeterCordes are you still insisting that my first sentence is too broad? Same rules that apply to mutex synchronized program apply to lock-free program too. It is just that lock free has less overhead. If you read OP program, then you will see that he made program, where is nothing but access to shared data. Because of that the whole program is slower. Not only single thread. – Marek Vitek Jul 03 '17 at 08:06
  • I just don't like making generalizations without weasel-words. I mean, in this case the OP created a multi-threaded program to increment a counter N times that is far slower than a simple single-threaded implementation. Because he parallelized the wrong way (all threads competing for the shared counter instead of incrementing a local counter and adding it to the shared counter when done). In your analogy, this is like building the porsche wrong, or something. – Peter Cordes Jul 03 '17 at 10:39
  • *Same rules that apply to mutex synchronized program apply to lock-free program too. It is just that lock free has less overhead.* I agree. But I was only comparing **lock-free vs. single-threaded**. If you *are* going to multi-thread, lock-free is usually great if you do it right and if it's possible to achieve correctness without making operations on shared data a lot slower than if you could assume exclusive access. – Peter Cordes Jul 03 '17 at 10:46
2

It depends on the concrete workload.

See amdahl's law

                     100 % (whole workload in percentage)
speedup =  -----------------------------------------------------------
            (sequential work load in %) + (parallel workload in %) / (count of workers)

The parallel workload in your program is 0 %, so the speedup is 1. Aka no speedup. (You are synchronising for incrementing the same memory cell and so only one thread can increment the cell at any given time.)

Rough explanation, why it even performs worse then speedup=1:

The cache line containing gCounter stays in the cpu cache with only one thread.

With multiple threads, which are scheduled to different cpus or cores, the cache line containing gCounter will bounce around the different caches for the cpus ore cores.

So the difference is somewhat comparable to incrementing a register with only one thread compared to accessing memory for each increment operation. (Sometimes it is faster than a memory access, as there is cache to cache transfers in modern cpu architectures.)

typetetris
  • 4,586
  • 16
  • 31
  • Thats true, but it explains the performance degradation of this concrete program, which might be helpful for the asker. – typetetris Jun 15 '17 at 07:45
0

Like most very broad which is faster questions, the only completely general answer is it depends.

A good mental model is that when parallelizing an existing task is that the runtime of the parallel version over N threads will be composed of rought three contributions:

  1. A still serial part common to both the serial and parallel algorithms. I.e,. work that wasn't parallelized such as setup or tear down work, or work that didn't run in parallel because the task was inexactly partitioned1.

  2. A parallel part which was effectively parallelized among the N workers.

  3. An overhead component that represents extra work done in the parallel algorithm that doesn't exist in the serial version. There is almost invariably some small amount of overhead to partition the work, delegate to worker threads and combine the results, but in some cases the overhead can swamp the real work.

So in general you have these three contributions, and lets assign T1p, T2p and T3p respectively. Now the T1p component exists and takes the same time in both the serial and parallel algorithms, so we can ignore since it cancels out for the purposes of determining which is slower.

Of course, if you used coarser grained synchronization, e.g., incrementing a local variable on each thread and only periodically (perhaps only once at the very end) updating the shared variable, the situation would reverse.


1 This also includes the case where the workload was well partitioned, but some threads did more work per unit time, which is common on modern CPUs and in modern OSes.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386