Optimal way to pass a few variables between 2 threads pinning different CPUs

Question

I have a problem that I need to understand if there is a better solution. I have written the following code to pass a few variables from a writer thread to a reader thread. These threads pinned to different CPUs sharing the same L2 cache (disabled hyperthreading).

writer_thread.h

struct a_few_vars {
    uint32_t x1;
    uint32_t x2;

    uint64_t x3;
    uint64_t x4;
} __attribute__((aligned(64)));

volatile uint32_t head;
struct a_few_vars xxx[UINT16_MAX] __attribute__((aligned(64)));

reader_thread.h

uint32_t tail;
struct a_few_vars *p_xxx;

The writer thread increases the head variable and the reader thread checks whether the head variable and the tail is equal. If they are not equal then it reads the new data as follows

while (true) {
    if (tail != head) {
        .. process xxx[head] ..
        .. update tail ..
    }
}

Performance is by far the most important issue. I'm using Intel Xeon processors and the reader thread fetches the head value and the xxx[head] data from memory each time. I used the aligned array to be lock free

In my case, is there any method to flush the variables to the reader CPU cache as soon as possible. Can I trigger a prefetch for the reader CPU from writer CPU. I can use special Intel instructions using __asm__ if exist. In conclusion, what is the fastest way to pass the variables in the struct between threads pinning to different CPUs?

Thanks in advance

`volatile` is _insufficient_ to prevent a race condition. You'll need a mutex or you'll have to access the variables via primitives from `stdatomic.h` — Craig Estey, Dec 23 '18 at 19:42
@CraigEstey, that is not true because the writer thread first updates the xxx[head + 1], then updates head, so it is sufficient to prevent a race condition. — avatli, Dec 23 '18 at 19:46
Are you using an ancient Core 2 Xeon? Actually those didn't have HT, so it would have to be Pentium 4 Xeon for everything in your question to make sense. All newer Intel CPUs (Nehalem and newer) have private per-core L1 and L2, and only the last-level L3 cache is shared. This includes KNL (Xeon Phi). — Peter Cordes, Dec 23 '18 at 19:47
@PeterCordes, you are right, I wrote the same L2 cache to emphasize that there is not any NUMA transation. The CPUs are in the same processor. — avatli, Dec 23 '18 at 19:58
Um, no ... Because you need something to guarantee that the cache on one CPU core is _flushed_ and visible to the other core atomically (e.g. on `x86`, you need an `mfence` variant). So, once again, either a mutex or atomic operation — Craig Estey, Dec 23 '18 at 19:58
@CraigEstey: It's UB, but no you don't need `mfence`. You just need to stop compile-time reordering. The writer thread is write-only, so all you need is ordered writes. x86 asm does acquire / release semantics for free, and we don't need seq-cst here. (`memory_order_release` compiles without any extra barrier instructions on x86, just blocking compile-time reordering.) `mfence` doesn't make data visible to the other core any faster, it just stalls the current core's later loads until earlier stores are globally visible. Cache coherency doesn't take extra instructions, just ordering. — Peter Cordes, Dec 23 '18 at 20:00
@PeterCordes Of course, I defer to your [vast] `x86` arch knowledge. IMO, at a minimum, the code should include comments as to why, in the specific case, the more general solution isn't required as this is relying on something that may eventually break on a different model or arch (e.g. `arm`) — Craig Estey, Dec 23 '18 at 20:07
Guys, please talk more clearly to ordinary people like me. So I should not need a memory barrier (mfence) for the problem, right? — avatli, Dec 23 '18 at 20:12
@CraigEstey: The Right Way is using C11 `atomic_store_explicit(..., memory_order_release)`. On x86, that compiles to just a `mov` store. On ARM, that will compile to a store + `dsb ish`. On AArch64, that will compile to a `stra` or whatever it's called, a sequential-consistency release store but no barrier. You don't want to mess around with targeting the asm memory model from C using `_mm_mfence()` manually or stuff like that, because the whole point of C11 stdatomic is to let you write portable code and have the compiler do that for you. It's nice to know what's efficient, though. — Peter Cordes, Dec 23 '18 at 20:16

Peter Cordes · Accepted Answer · 2019-08-17T14:25:10.027

It's undefined behaviour for one thread to write a volatile variable while another thread reads it, according to C11. volatile accesses are also not ordered with respect to other accesses. You want atomic_store_explicit(&head, new_value, memory_order_release) in the writer and atomic_load_explicit(&head, memory_order_acquire) in the reader to create acq/rel synchronization, and force the compiler to make the stores into your struct visible before the store to head which indicates to the reader that there's new data.

(tail is private to the reader thread, so there's no mechanism for the writer to wait for the reader to have seen new data before writing more. So technically there's a possible race on the struct contents if the writer thread writes again while the reader is still reading. So the struct should also be _Atomic).

You might want a seq-lock where the writer updates a sequence number and the reader checks it before and after copying out the variables. https://en.wikipedia.org/wiki/Seqlock This lets you detect and retry in the rare cases where the writer was in the middle of an update when the reader copied the data.

It's pretty good for write-only / read-only situations, especially if you don't need to worry about the reader missing an update.

See my attempt at a SeqLock in C++11: Implementing 64 bit atomic counter with 32 bit atomics and also how to implement a seqlock lock using c++11 atomic library

And GCC reordering up across load with `memory_order_seq_cst`. Is this allowed? shows another example (this one causes a gcc bug).

Porting these from C++11 std::atomic to C11 stdatomic should be straightforward. Make sure to use atomic_store_explicit, because the default memory ordering for plain atomic_store is memory_order_seq_cst which is slower.

Not much you can do will actually speed up the writer making its stores globally visible. A CPU core already commits stores from its store buffer to its L1d as quickly as possible (obeying the restrictions of the x86 memory model, which doesn't allow StoreStore reordering).

On a Xeon, see When CPU flush value in storebuffer to L1 Cache? for some info about different Snoop Modes and their effect on inter-core latency within a single socket.

The caches on multiple cores are coherent, using MESI to maintain coherency.

A reader spin-waiting on an atomic variable is probably the best you can do, using _mm_pause() inside the spin loop to avoid a memory-order mis-speculation pipeline clear when exiting the spin-loop.

You also don't want to wake up in the middle of a write and have to retry. You might want to put the seq-lock counter in the same cache line as the data, so those stores can hopefully be merged in the store buffer of the writing core.

Personally, I've come to like a ticket lock to prevent starvation and promote fairness: https://en.wikipedia.org/wiki/Ticket_lock All the rest about limiting the spin probably applies. From the page, I _think_ that spin on the `now_serving` can be done with simple fetch (i.e. non-atomic) as only the `fetch_and_inc` needs to be atomic. I've used this in production code for a shipping product. — Craig Estey, Dec 23 '18 at 22:38
@CraigEstey: it has to be an atomic fetch, e.g. `atomic_fetch_explicit(&now_server, memory_order_acquire)`. It's not a *read-modify-write*, but it has to be atomic or else there could be tearing of the value. (And in C you need the compiler to assume that other threads can change the value.) Naturally-aligned pure loads/stores of integers/pointers are atomic for free in asm on most ISAs, but in C you *must* tell the compiler about it or else it's data-race undefined behaviour. e.g. [Why is integer assignment on a naturally aligned variable atomic on x86?](https://stackoverflow.com/q/36624881) — Peter Cordes, Dec 23 '18 at 23:02
That's still a lock where both the reader and writer are modifying the lock variable. That's probably good if you have multiple readers and multiple writers, but one write-only thread publishing to one or more read-only threads is a classic use-case for a seqlock or a lockless queue. (Single-writer / single-reader does simplify a lockless queue significantly). — Peter Cordes, Dec 23 '18 at 23:06
I've always done the release with atomic, just to be safe [my impl. had multiple readers and writers]. The wiki is a [too] bit general [I only said the release _might_ be done non-atomic because the wiki showed it that way]. To release the `now_serving`, the _owner_ already _knows_ the value to store (vs _incrementing_) if the code is restructured (e.g. `ticketLock_acquire` returns the `my_ticket` value and can _store_ `my_ticket + 1`). If the value tears, it just delays the _next_ acquirer a bit longer (i.e. atomic flush would be faster) but it would work [eventually] — Craig Estey, Dec 24 '18 at 00:01
As to seqlock, I've just looked over the latest `include/linux/seqlock.h` and it seems a bit more heavyweight. But, it's not a pure _lock_, since the reader(s) have to _retry_ if they detect that the seqlock value has changed: `do { seqlock_acq ; do_stuff ; } while (seqlock_retry);`. So, `do_stuff` is a bit restricted as what it can do. Suppose `do_stuff` needed the read value of what the lock was protecting, but did something irrevocable to _another_ variable with that value. — Craig Estey, Dec 24 '18 at 00:14
Hmm. `now_serving` is only modified while the lock is held, so it might actually be ok to make it non-`atomic` in C. But no, `ticketLock_acquire` spins on it changing, so C11 requires that it's an `atomic` variable for a reader to actually notice the change. The Wiki code is described as pseudocode, not a proper C11 implementation. The increment to `now_serving` could be done as a separate atomic load and store, avoiding an atomic RMW. e.g. `*now_serving = 1 + *atomic_load_explict(now_serving, mo_relaxed);`. (And the store part could be `release`, not seq-cst.) — Peter Cordes, Dec 24 '18 at 00:15

Optimal way to pass a few variables between 2 threads pinning different CPUs

1 Answers1

Linked