How to ensure writes are visible to other cores

Question

I have the following situation:

Process 1 (on core 1):

set_nonzero_8byte_posix_shm_memory_to_zero();
run_a_function();

Process 2 (on core 2):

uint64_t v = read_that_8byte_posix_shm_memory();
if (v != 0) {
  // infer the function has not run yet
}

Essentially, I wanted the set_nonzero_8byte_shm_memory_to_zero() to wait until the store was visible to the other cores, so that on other processes (i.e. process 2) the read can make the described inference.

I thought of using an sfence between set_nonzero...to_zero() and run_a_function(), but I saw in the Linux memory barriers documentation, https://www.kernel.org/doc/Documentation/memory-barriers.txt, which says

There is no guarantee that any of the memory accesses specified before a memory barrier will be complete by the completion of a memory barrier instruction; the barrier can be considered to draw a line in that CPU's access queue that accesses of the appropriate type may not cross.

Hence, my interpretation of this was that passing the sfence (and having "completed" the set to zero), and having started the run_a_function(), would not imply that my read on process 2 would be guaranteed to read 0 (and as such, I would think run_a_function() has not happened yet).

I was wondering how could I get this behavior I wanted? (Would using that address as volatile cut it, would an atomic store with sequential consistency do it, etc)?

Information about my environmnet: I am on a high-core count NUMA dual-socket machine (x86, 64-bit), however AFAIK everything is numactl'ed to stay on a particular socket.

Any help would be much appreciated, thank you!

Sounds like you want C11 atomic types. Or a shared mutex to protect access to that shared variable (and function) — Shawn, Jul 04 '22 at 20:07
I think you have to start with a design that isn't inherently racy. It doesn't matter how many barriers Process 1 executes, it is still possible that Process 2 reads a nonzero value (before Process 1 writes 0), but then by the time it gets to the next line of code, bot the write of 0 and `run_a_function()` may have begin to execute or even complete. If Process 2 needs to be sure that `run_a_function()` hasn't started running, then *both* processes have to cooperate to ensure this doesn't happen. — Nate Eldredge, Jul 04 '22 at 22:57
The typical design would be to have a mutex or semaphore that Process 1 must hold while it sets the flag and executes `run_a_function()`, and that Process 2 must hold while it tests the flag and performs whatever action `run_a_function()` would interfere with. Notice this means that if one process tries to take the lock while the other holds it, it will actually *block* at the level of the OS. No low-level machine instruction can accomplish that. — Nate Eldredge, Jul 04 '22 at 23:02
You have your synchronization backwards in the writer. Do a release-store *after* running the function (C11 `atomic_store_explicit(&shared_mem, 1, memory_order_release);`). Then if an acquire load in the reader sees a `1`, it knows the function has *finished* running, not just about-to-start. https://preshing.com/20120913/acquire-and-release-semantics/. — Peter Cordes, Jul 05 '22 at 06:54
If you insist on rolling your own atomic operations with `volatile`, [yes that will work on all real-world systems (but it's not recommended, use lock-free C11 atomics instead)](https://stackoverflow.com/questions/4557979/when-to-use-volatile-with-multi-threading/58535118#58535118). But you'll need to roll your own memory barriers, too. x86 doesn't need any actual asm instructions, just blocking compile-time reordering. (`sfence` is useless unless you've been using NT stores). Other ISAs like AArch64 do need barriers, or much better, a release-store instruction (`stlr`). — Peter Cordes, Jul 05 '22 at 06:56
Oh, wait a minute, you do want to infer that the function hasn't even *started* yet. That can't work, the `v=0` and execution of the function might become visible to the reader after some test, before the thing it was going to do if safe. (This is what Nate pointed out). If it's not actually a correctness problem to run the `if` body while the other thread is running `run_a_function()`, that's maybe ok. C11 stdatomic functions work on shared memory *if* they're lock-free for that type. (Otherwise they use a table of locks in the current process, which another proc won't respect.) — Peter Cordes, Jul 05 '22 at 07:01
On x86, to make this core block later loads (and stores) until all previous stores are visible to other threads, you want a *full barrier* (https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/), such as a dummy `lock add $0, (%rsp)`. Or better, do the actual store with an `xchg` instead of `mov`. (e.g. `atomic_store` with the default `memory_order_seq_cst`, but that's an x86 implementation detail: other ISAs like AArch64 can do seq_cst that's only as strong as C11 requires, still allowing StoreLoad reordering with later non-SC loads in this thread.) — Peter Cordes, Jul 05 '22 at 07:04
Oh ok, thank you for this information, I understand better now, I will use an atomic store. — Mihir Shah, Jul 06 '22 at 00:46

How to ensure writes are visible to other cores

0 Answers0