Cache-friendly way to collect results from multiple threads

Question

Consider N threads doing some asynchronous tasks with small result value like double or int64_t. So about 8 result values can fit a single CPU cache line. N is equal to the number of CPU cores.

On one hand, if I just allocate an array of N items, each a double or int64_t, then 8 threads will share a CPU cache line, which seems inefficient.

On the other hand, if I allocate a whole cache line for each double/int64_t, the receiver thread will have to fetch N cache lines, each written by a different CPU core (except 1).

So is there an efficient solution for this scenario? The CPU is x86-64. A solution in C++ is preferred.

Clarification 1: thread launch/exit overhead is not big because thread pool is used. So it's mostly synchronization on a critical section.

Clarification 2: The parallel batches carry a dependency. The master thread can only launch the next batch of parallel computations after it has collected and processed the results of the previous batch. Because results of the previous batch serve as some parameters of the next batch.

Are you sure that not-enough-cache-friendly result returning is a performance limiting factor? Not thread sync, not context switches, not tasks jobs? — user7860670, Sep 12 '17 at 10:28
@VTT, at least it's a substantial factor. If cache contention is `30.5 cycles` delay, and there are 15 threads to contend with, then the delay is `460 cycles` in total, limiting to 7.83 millions such operations per second. This is substantial w.r.t. CPU's 3.6 billions of operation per second. — Serge Rogatch, Sep 12 '17 at 10:36
It seems you only have 2 options, write an array or write independent variables. I would be tempted to try both and see who wins. — Galik, Sep 12 '17 at 10:38
I don't quite get it. The work each thread does should obviously avoid any false sharing (reading/writing to cache lines common with other threads), but reporting the result back to the system *once* may well be done to a common cache line. So the critical question here is how often is your *receiver* fetching the data for each time they are written/assigned? — Walter, Sep 12 '17 at 10:39
@Walter, yes, threads report the result only once after it's calculated. However, I would like to scatter and gather millions of such parallel calculations per second. In the other words, the head thread distributes the work, launches worker threads, then collects the results, does something on it, and then launches parallel computations again. — Serge Rogatch, Sep 12 '17 at 10:44
This is a little bit off topic, but this sounds a bit like a problem for techniques like OpenMP or Cilk Plus. I know that at least the latter one has objects called "reducers" which do the accumulation of results. Note that it is very inefficient to call thousands of short threads in comparision to less but longer threads. — Jounathaen, Sep 12 '17 at 11:31
Have you tried experimenting with WC memory type (that bypasses the caches but still don't require a full roundtrip to mem), NT stores (if you just publish the results once) and ERMS (that don't do a RFO but may have an overhead for small buffers)? — Margaret Bloom, Sep 12 '17 at 14:15
@MargaretBloom, I haven't tried that and I need examples on how specifically to do that. I think bypassing cache completely would result in large delays due to accessing DRAM each time. It should be better (if possible) to just bypass core-specific L1 and L2 caches, but let the result value get to L3 cache. — Serge Rogatch, Sep 12 '17 at 15:02
@MargaretBloom: What makes you think USWC (aka WC) doesn't require a round trip to memory? It's **uncacheable**, with write-combining for adjacent stores from the same thread (if they use MOVNT). The reader can use MOVNTDQA to read the whole line, but it's definitely a round trip to DRAM. I think the only difference from UC is that a store to UC memory might stall other stores/loads until it completes because it's strongly-ordered. (So maybe worse than `movnt` + `sfence` to WC memory, but I don't know.) — Peter Cordes, Sep 12 '17 at 22:19
IDK how the memory controller would handle multiple 8B-writes to the same line. It might possibly merge them in the memory controller instead of doing separate interrupted-bursts to DRAM. — Peter Cordes, Sep 12 '17 at 22:22
@PeterCordes, yeah, bad wording on my side. I was thinking of the merging thing. — Margaret Bloom, Sep 13 '17 at 06:02
Is it possible to use a sentinel value (like NaN or a not-yet-available error code), or are all 2**64 possible 8-byte messages equally valid? — Iwillnotexist Idonotexist, Sep 13 '17 at 09:28
@IwillnotexistIdonotexist, that's often possible, if not always. — Serge Rogatch, Sep 13 '17 at 09:32
The scenario is quite unusual since you have batches of worth that are large enough that parallelizing them across 8 threads is worthwhile, yet small enough than the few dozen cycles synchronization cost for the results are meaningful? As the size of the result scales up, the problem goes away (every thread can write to their own cache line), and as the number of threads scales up it seems like this cost will be come relatively smaller and smaller (compared to linear costs of waking all the threads, etc). Even in a thread pool waking the threads costs something. — BeeOnRope, Sep 13 '17 at 21:17
In any case, I suspect that both approaches are fairly similar: each worker needs to get a line in the exclusive state (it might be the same line or different) and the result reader will need to either get one line (scenario 1) or 8 lines (scenario 2). So the reader cost is slightly lower in S1, but the writers may suffer additional contention if they all finish around the same time. To give a _real_ answer you'd probably have to explain how the threads are pooled, woken and how "work is done" is communicated back to the reader, because it's all relevant at this scale. — BeeOnRope, Sep 13 '17 at 21:20
For example, to wake the threads, and subsequently inform the consumer that all workers are done generally also requires some synchronization. Perhaps the communication of the single value result value per worker can simply be piggy-backed on that mechanism! — BeeOnRope, Sep 13 '17 at 21:21

Peter Cordes · Answer 1 · 2017-09-13T01:52:37.367

update: I may have misunderstood. Are you looking for fast turnarounds on many tiny batches of work? In that case, you're probably better off with each thread writing to its own cache line, or maybe group them in pairs. If each worker thread has to get exclusive access (MESI/MESIF/MOESI) to write into the same cache line, that will serialize all the cores into some order.

Having the reader thread read the results from N threads lets all those cache misses happen in parallel.

From your comment:

I would like to scatter and gather millions of such parallel calculations per second. In the other words, the head thread distributes the work, launches worker threads, then collects the results, does something on it, and then launches parallel computations again.

So you have millions of results to collect, but only one worker thread per core. So each worker thread has to produce ~100k results.

Give each worker an output array, where it stores consecutive results from different tasks it has finished. The actual arrays might only be 4k entries long or something, with some synchronization to let the writer wrap around and reuse the first half once the reader has started on the second half of that thread's buffer.

When the collector thread reads a result from one of those arrays, it will bring that cache line into its own L2/L1D caches, bringing with it the 7 other results in that same cache line (assuming the usual case where the worker thread has already filled all 8 int64_t slots and won't write that cache line again for this group of tiny tasks).

Or better, collect them in batches aligned to cache lines, so conflict misses don't evict a cache line from the collector's L1D before it gets back to it. (Reduce the chance of this happening by skewing the result arrays with a different offset for each thread, so the collector thread isn't reading N cache lines that are all offset from each other by a multiple of 4kiB or something.)

If you can use a sentinel value in your output arrays, that's probably ideal. If the collector sees that, it knows it got ahead of the worker and should check other threads. (Or sleep if it got through all output arrays without finding new results).

Otherwise you also need current-output-position shared variables which the workers update (with a release-store) after writing the output array. (Maybe batch these position-counter updates to one per 8 array results. But make sure you do it with a pure atomic store, not a += 8. Since the producer thread is the only one that writes that variable, it would be silly to have the overhead of a lock add.)

This would easily cause false sharing between worker threads if packed into one array, and also definitely needs to be cached (not in UC or WC memory, so a worker thread can rewrite it in-place efficiently). So you definitely want each thread to have its own cache line for these. The collector will just have to suffer the penalty of reading N different cache lines (and probably suffering memory mis-speculation machine clears: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)

Actually, the best option in that case would probably be to use one of the 8 qwords in every cache line of the output arrays as a "complete" flag or bitmap, so the collector thread can check that to see if the 7 results in a cache line are ready.

If just getting the results between worker and collector threads is your main bottleneck, then probably your threading is too fine-grained. You should break your tasks up more coarsely, or have your worker threads do some of the combining on multiple results it produced, while they're still hot in its L1D. That's much better bandwidth than getting it to another core through L3 or DRAM.

Unfortunately, although there are some applicable points, this answer is largely based on misunderstanding as if batches are independent. Indeed, the master thread can only launch the next batch of parallel computations after it has collected and processed the results of the previous batch. Because results of the previous batch serve as some parameters of the next batch. — Serge Rogatch, Sep 13 '17 at 06:30
@SergeRogatch: Instead of going back to a master thread, can you have each worker thread read the results of the other threads, and duplicate the calculations needed to figure out what to do next? Results in duplicate work, but saves a synchronization round trip. — Peter Cordes, Sep 13 '17 at 06:32
I think no, that changes and complicates the design a lot, and the master thread may need some synchronization (to protect from parallel requests from other users), so I am afraid of moving all this to the worker threads. — Serge Rogatch, Sep 13 '17 at 06:47
The problem is you haven't really given enough surrounding details for Peter to give a complete answer, have you? A normal distribute -> consume model involves the master thread blocking, which is already usually thousands of cycles or more, so you have be doing thousands of cycles of work for it to make sense, and the dozens of cycles of synchronization overhead of one "result returning" technique over the other isn't going to matter. So I assume that you have some faster non-blocking way for the master thread to wait, spinning on some memory location on some such. — BeeOnRope, Sep 13 '17 at 21:24
Knowing how it all works would help a lot. If there is really no existing system and this is all just theoretical, then I really think it just doesn't matter: you need to get the other stuff all right before this matters and the details are all very important. Only then, maybe you can try to optimize this aspect (but as above, maybe you can pass the result for free on your synchronization mechanism). Even beyond that there are tons of other relevant details: will all the work batches finish at the almost exactly the same time? Can the consumer do some work before all results are received? — BeeOnRope, Sep 13 '17 at 21:28

score 1 · Answer 2 · answered Sep 12 '17 at 11:10

1

If the number of accesses/writes of the worker threads far exceeds the resulting reporting to/reading from the head/master thread, then

You must avoid false sharing (using common cache line) between the workers. This should be done by using automatic variables (which may actually be implemented as register-only) for the internal work.
Communicating results back to (or inputs from) the master thread is less efficiency critical and may use an array (i.e. a common cache line). Here, you can simply experiment what works best.

answered Sep 12 '17 at 11:10

Walter

44,150
20
113
196

I think there should be a trick like selecting a different cache strategy (WB/WC/WT etc), so that worker threads can just write their 8 bytes each, without fetching the whole cache line for their usage. Or some kind of non-temporal hint may help, if there is something to let it reach L3 cache only, without copying to per-core L1/L2 caches. So to experiment I need a few more options except the 2 I described in the question (and I guess they may give the same performance, which is too low for my needs). – Serge Rogatch Sep 12 '17 at 13:37
@SergeRogatch: That sounds like a bad idea. Intel's manual suggests not using NT stores to the same line as a mutex; IDK if it's dangerous and can cause non-coherent behaviour or what, but it still sounds like a bad idea. WC and UC are terrible because the writes have to go all the way to DRAM, and WB or WT are both cacheable memory types, so you get false sharing. – Peter Cordes Sep 12 '17 at 22:13
My impression was that each worker would only write to the result value _once_ per batch of work, which would then be read _once_ by the master thread. If it's not like that you can make it like that by doing work in a temporary location on each worker and then only updating the shared location once when the work is complete. Any repeated updating of the same location by the workers would just be false sharing for no real purpose. – BeeOnRope Sep 13 '17 at 21:31

Cache-friendly way to collect results from multiple threads

2 Answers2

Linked