Accessing Shared Memory Without Volatile, std::atomic, semaphore, mutex, and spinlock?

Question

The answer interestingly points out:

You do in fact need to modify your code to not use C library functions on volatile buffers. Your options include:

Write your own alternative to the C library function that works with volatile buffers.

Use a proper memory barrier.

I am curious how #2 is possible. Let's say 2 (single threaded) processes use shm_open() + memcpy() to create/open the same shared memory on CentOS 7. And I am using gcc/g++ 7 and on x86-64.

I think your question is being answered in the answer you have linked: [You've gone with the first option; but...might change the buffer).] — Jose, Jul 10 '18 at 12:06
Hmm...I just reread again. It is still unclear to me what kind of barriers he is referring to. Hardware barriers (which one sfence, lfence, mfence, or the one coming with exchg, lock add, etc)? Software/compiler barrier? And how to achieve? e.g. putting a barrier after writing? before reading? or? I think his answer just tells why it works -- The compiler must assume that the barrier check might change the buffer. — HCSF, Jul 10 '18 at 12:12
What the answerer probably meant was mutex or semaphore. A memory barrier is usually just an instruction telling the compiler and the CPU not to reorder instructions. — vgru, Jul 10 '18 at 12:25
Right. That's my understanding about barrier as well. However, I suspect that there might be something more than that (e.g. because of a hardware barrier, the compiler *might be* forced to generate code to read from the memory instead of register). That's why I made this post and hopefully someone can chip in. — HCSF, Jul 10 '18 at 12:30
@Groo: [memory reordering can happen without instruction reordering](https://stackoverflow.com/questions/50494658/are-loads-and-stores-the-only-instructions-that-gets-reordered/50496379#50496379). A memory barrier tells the CPU to wait for previous operations to become globally visible (to other threads). (At least before allowing later stores to become visible; it can still execute them locally. `mfence` has to stop later loads from even executing (taking their value) until after all stores before the barrier are commited to L1d, though.) TL:DR: the store buffer matters for barriers. — Peter Cordes, Jul 10 '18 at 12:48
@PeterCordes thanks for adding the CPU reordering part. I missed that part in my reply. Any idea how #2 is possible? Thanks! — HCSF, Jul 10 '18 at 12:51
@PeterCordes: when I see the phrase "memory barrier", I expect it to refer to a `mfence`, yes. If you simply want to prevent instructions from being reordered by the compiler, you will use a *compiler* barrier, which might be enough if your cpu does not perform any reordering. — vgru, Jul 10 '18 at 12:55
@Groo: everything I wrote in that comment was about `mfence` in asm, not a compiler-only barrier. It's the order of stores committing to L1d that matters, not out-of-order execution of when they enter the store buffer. If you'd said "... CPU not to reorder memory operations" (instead of "instructions"), I would have agreed with you. I'm just being pedantic here, but instruction reordering is a different thing from memory-operation reordering. Even though the memory operations are caused by instructions. — Peter Cordes, Jul 10 '18 at 13:27

score 3 · Accepted Answer · answered Jul 10 '18 at 13:23

3

Roll your own compiler memory barrier, to tell the compiler that all global variables may have been asynchronously modified.

In C++11 and later, the language defines a memory model which specifies that data races on non-atomic variables is undefined behaviour. So although this still works in practice on modern compilers, we should probably only talk about about C++03 and earlier. Before C++11, you had to roll your own, or use pthreads library functions or whatever other library.

Related: How does a mutex lock and unlock functions prevents CPU reordering?

In GNU C asm("" ::: "memory") is a compiler memory barrier. On x86, a strongly-ordered architecture, this alone gives you acq_rel semantics because the only kind of runtime reordering x86 can do is StoreLoad.

The optimizer treats it exactly like a function call to a non-inline function: any memory that anything outside this function could have a pointer to is assumed to be modified. See Understanding volatile asm vs volatile variable. (A GNU C extended asm statement with no outputs is implicitly volatile, so asm volatile("" ::: "memory") is more explicit but equivalent.)

See also http://preshing.com/20120625/memory-ordering-at-compile-time/ for more about compiler barriers. But note that this isn't just blocking reordering, it's blocking optimizations like keeping the value in a register in a loop.

e.g. a spin loop like while(shared_var) {} can compile to if(shared_var) infinite_loop;, but with a barrier we can prevent that:

void spinwait(int *ptr_to_shmem) {
    while(shared_var) {
        asm("" ::: "memory");
    }
}

gcc -O3 for x86-64 (on the Godbolt compiler explorer) compiles this to asm that looks like the source, without hoisting the load out of the loop:

# gcc's output
spinwait(int*):
    jmp     .L5           # gcc doesn't check or know that the asm statement is empty
.L3:
#APP
# 3 "/tmp/compiler-explorer-compiler118610-54-z1284x.occil/example.cpp" 1
        #asm comment: barrier here
# 0 "" 2
#NO_APP
.L5:
    mov     eax, DWORD PTR [rdi]
    test    eax, eax
    jne     .L3
    ret

The asm statement is still a volatile asm statement which has to run exactly as many times as the loop body runs in the C abstract machine. GCC jumps over the empty asm statement to reach the condition at the bottom of the loop to make sure the condition is checked before running the (empty) asm statement. I put an asm comment in the asm template to see where it ends up in the compiler-generated asm for the whole function. We could have avoided this by writing a do{}while() loop in the C source. (Why are loops always compiled into "do...while" style (tail jump)?).

Other than that, it's the same as the asm we get from using std::atomic_int or volatile. (See the Godbolt link).

Without the barrier, it does hoist the load:

# clang6.0 -O3
spinwait_nobarrier(int*):               # @spinwait_nobarrier(int*)
        cmp     dword ptr [rdi], 0
        je      .LBB1_2

.LBB1_1:                     #infinite loop
        jmp     .LBB1_1

.LBB1_2:                     # jump target for 0 on entry
        ret

Without anything compiler-specific, you could actually use a non-inline function to defeat the optimizer, but you might have to put it in a library to defeat link-time optimization. Merely another source file is not sufficient. So you end up needing a system-specific Makefile or whatever. (And it has runtime overhead).

answered Jul 10 '18 at 13:23

Peter Cordes

328,167
45
605
847

It is interesting that `asm("" ::: "memory")` does more than just preventing the compiler from optimizing the empty loop away. Learnt a new thing. Thanks. However, what I don't understand is this statement -- "it's blocking optimizations like keeping the value in a register in a loop". So when GCC sees `asm("" ::: "memory")` in a (while/for/do-while) loop, which value (variable) won't be allowed to keep in a register? All the values used in the loop? All the values used in the function containing the loop? Or it goes even further up to all values in the function call chain? – HCSF Jul 10 '18 at 14:01
@HCSF: any variables whose address has escaped the function, just like for a non-inline function call. I explained all this in my answer to [Understanding volatile asm vs volatile variable](https://stackoverflow.com/a/50937064), which is why I linked it. – Peter Cordes Jul 10 '18 at 14:04
I also read your [answer to another post](https://stackoverflow.com/questions/38884893/run-time-overhead-of-compiler-barrier-in-gcc-for-x86-processors). You mentioned `Every memory location which another thread might have a pointer to needs to be up to date before the barrier, and reloaded after. So any such values that are live in registers needed to be stored (if dirty), or just "forgotten about" if the value in a register is just a copy of what's still in memory.` How can the compiler tell the less obvious pointer that can be accessed in another thread (e.g. ptr to shared memory here). – HCSF Jul 10 '18 at 14:05
@HCSF: Everything needs to be stored/reloaded, unless the compiler can *prove* that nothing else can have a reference to it. This is called escape analysis. See also discussion in comments on [How does a mutex lock and unlock functions prevents CPU reordering?](https://stackoverflow.com/posts/comments/88953397), where the same question came up. – Peter Cordes Jul 10 '18 at 14:06
Compilers need to conform the (C/C++) standard. So it can only prove based on the standard. Then, now, we have shared memory, which standard doesn't talk much, and so the compilers are free to deal with it in any way so long as the semantic doesn't break the standard. In that sense, the compilers basically can't prove any variable having no reference from other threads or processes, right? – HCSF Jul 10 '18 at 14:11
Now, I see how gcc determines which variable has to be read from or write to the memory when gcc sees `asm("" ::: "memory")`. Thanks. Now, going back to my original question. It seems like I don't really need `asm("" ::: "memory")` barrier neither because I noticed (please correct me if I am wrong) that dereferencing a pointer (to the shared memory), GCC will translate it to an instruction to access the memory instead of register. If that's correct, for a simple producer-consumer system (1 producer + 1 consumer on 1 shared memory segment). No barrier is actually needed on x86_64 because. – HCSF Jul 10 '18 at 14:59
x86 enforces that only StoreLoad reorder is possible, then if my writer writes ptr_to_shmem[0] last and writes to the rest first, then the reader just needs to keep reading on ptr_to_shmem[0] without volatile and without the barrier, it will still work, no? – HCSF Jul 10 '18 at 15:01
Tho, I am sure that my guess/observation is wrong because [the post I referred to in the original post](https://stackoverflow.com/questions/41051724/why-does-the-compiler-optimize-away-shared-memory-reads-due-to-strncmp-even-if) has an issue when calling `strncmp((char *) mem, "exit", 4)` that it seems to be optimized away. Thought `strncmp()` would have to deference `mem` anyway and so `strncmp()` shouldn't be optimized and reading from memory was required. No? – HCSF Jul 10 '18 at 15:08
@HCSF: ordering isn't the problem, it's the assumption that re-reading the same location will give the same result (i.e. that no other threads can asynchronously modify the location, because of data-race UB.) The `"memory"` clobber is giving you the effect of `volatile` or a relaxed-atomic. – Peter Cordes Jul 10 '18 at 15:49
2

I think I finally got your point -- the barrier `asm("" ::: "memory")` can be added to the `while` loop in [the post I referred to in the original post](https://stackoverflow.com/questions/41051724/why-does-the-compiler-optimize-away-shared-memory-reads-due-to-strncmp-even-if) because when GCC sees that barrier, it is treated as a non-inline function that the function can potentially change any variable that it can access, and so the compiler can't just call `strncmp()` once and done with the loop. – HCSF Jul 10 '18 at 17:04
And I did more [experiment](https://godbolt.org/g/Ptfzgo) by using a while loop to call a function with the barrier. But it seems like GCC 8.1 still optimizes away even tho the function contains a barrier. The [doc](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Volatile) doesn't seem to specify how deep down the compiler will check whether it can optimize out the loop. – HCSF Jul 10 '18 at 17:04
@HCSF: your functions have undefined behaviour if they fall off the end of a non-void function. GCC apparently assumes the loop must be infinite so the UB isn't reached. Adding `return false` or changing the return type to `void` gives the code-gen you expect (https://godbolt.org/g/jpSHJf). Godbolt indicates lines with warnings with a green mark: click the `0/4` button at the bottom of the compiler pane to see the 4 lines of warning output. (In my link, I already did that and arranged the compiler-messages pane into the layout.) – Peter Cordes Jul 10 '18 at 21:15
I also used `asm("nop # from asm":::"memory")` to see where it goes in the compiler output. (A pure comment like `asm("# asm inlined here":::"memory")` works too, but you have to turn off asm-comment filtering in Godbolt to see it.) – Peter Cordes Jul 10 '18 at 21:17

davmac · Answer 2 · 2018-07-10T13:58:49.793

2

To directly answer your immediate question: Use a standard memory barrier - change the while loop to:

while (strncmp((char *) mem, "exit", 4) != 0)
    atomic_thread_fence(memory_order_acquire);

(Note that that is C. You've tagged your question as C++, while the original post that you refer to is C. The equivalent C++ looks very similar, however).

Roughly speaking, memory_order_acquire implies that you want to see changes made by other threads (or in this case, other processes). This seems to be enough, with current compilers in some simple experiments I conducted, but technically might not be sufficient without the presence of atomic operations. A full solution would re-implement the strncmp function using atomic loads.

Strictly speaking you shouldn't use strncmp and the like on volatile buffers (even with the memory barrier, this is almost certainly provoking undefined behaviour, though I imagine you'll never have a problem with current compilers).

Also there are much better ways to solve the problem described in the post you linked. In particular, for a case like that it makes very little sense to use shared memory in the first place; a simple pipe would be a much better solution.

edited Jul 10 '18 at 13:58

answered Jul 10 '18 at 13:20

davmac

20,150
1
40
68

1

`atomic_thread_fence` doesn't stop non-atomic operations from being hoisted out of loops; if there's no `atomic` operation between two fences, they can collapse together. As an implementation details, `atomic_signal_fence` on gcc does seem to work like `asm("" ::: "memory")` and block reordering / folding of operations even on non-atomic variables, though. (But gcc's `atomic_thread_fence` doesn't.) – Peter Cordes Jul 10 '18 at 13:30
TL:DR: std::atomic fences only do anything when combined with reads or writes of `atomic<>` objects, according to the language standard. In some cases in non-inline functions, you will see an effect because the function has to work correctly in a program that *does* do atomic ops before/after the call, but loop optimizations, or repeated access to the same variable, within a function can know there aren't any atomic ops. – Peter Cordes Jul 10 '18 at 13:32
@PeterCordes I understand this doesn't prove anything, but at least for GCC and Clang it does solve the problem in the linked post (https://godbolt.org/g/diW362). Still thinking about this. Thanks. – davmac Jul 10 '18 at 13:37
Here's a very simple example (https://godbolt.org/g/Jafdri) of a case where `std::atomic_thread_fence(std::memory_order_release);` doesn't stop two stores to the same variable from reordering across it and coalescing into one. IDK if that proves anything, because there's no guarantee an observer could ever see the `var=1` store. (Even if `var` was atomic). It does work for a spin-wait like in my answer, though: https://godbolt.org/g/Pt87da, so I guess my earlier comments are wrong, and it's probably fine in any case that matters. But I'm still not *sure*. – Peter Cordes Jul 10 '18 at 13:49
@PeterCordes in your example the second store can move _forward_ past the barrier, and then the stores can be collapsed. However, changing it `memory_order_acq_rel` doesn't seem to to make a difference (`seq_cst` does). I'm not sure either... I suspect you may be correct and the compilers aren't optimising to the extent that they could. – davmac Jul 10 '18 at 13:55
1

@PeterCordes I've edited the answer to suggest that a full solution requires use of atomic operations. – davmac Jul 10 '18 at 13:59
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/174729/discussion-between-davmac-and-peter-cordes). – davmac Jul 10 '18 at 14:07
If you're going to `#include `, you might as well use `atomic` instead of trying to fake it with `int` + barriers. I like my answer which suggests a GNU C asm memory clobber, which definitely does work and have the intended side effects. – Peter Cordes Jul 10 '18 at 14:11
@davmac: *Roughly speaking, `memory_order_acquire` implies that you want to see changes made by other threads (or in this case, other processes)* - this might give a false idea that the entire write to the buffer will be atomic. The reason why you would actually a fence here would be if there were two related writes in the other thread that you needed to ensure were ordered, which isn't the case here. At the same time, if a different thread writes `"exAB"` to the buffer, and then `"CDit"` while `strncmp` is reading, `strncmp` could succeed without the buffer ever being equal to `exit`. – vgru Jul 11 '18 at 07:46
@Groo I don't see how that statement implies that the entire write to the buffer will be atomic. What you say about a need to ensure ordering of two writes in another thread isn't right either; an acquire synchronises with a release in another thread and makes the changes prior to the release made in the other thread visible to this thread (after the acquire). Strictly, you need an atomic operation for this to happen, though my answer already says this. Your example with a third thread is correct but not relevant to this question, I think. – davmac Jul 11 '18 at 11:00
I re-read 32.9 Fences in n4713. It only talks about how a release fence interacts with an acquire fence or an atomic operation with an acquire operation. The [example](https://godbolt.org/g/Jafdri) Peter put up above does not trigger GCC to generate any special instruction for CPU. I suspect that GCC 7.1 was wrong in a sense that it optimized too much and prevented var=1 from being visible to another thread, which might have an acquire fence and try to read var with a perfect timing to observe var=1. – HCSF Jul 12 '18 at 09:04
1

But Peter's answer states "In C++11 and later, the language defines a memory model which specifies that data races on non-atomic variables is undefined behaviour". So I guess GCC 7.1 isn't wrong because var isn't atomic and so it is a UB. – HCSF Jul 12 '18 at 09:12
@HCSF in that example I'd say the compiler is definitely allowed to collapse the two stores since it is permitted to move the 2nd store forwards through the barrier. However I'm not sure what you mean about the UB seeing as that example doesn't contain a data race by itself. I _think_ we all pretty much agree that a fence as I've suggested isn't enough, by the wording of the lang. standard, to guarantee that you'll see the effect of writes performed by the other process, but it seems to be enough the create at least compiler barrier in practice. – davmac Jul 12 '18 at 10:26
@davmac The reason why I think it could be UB is that the fence section talks about things around atomic object. In Peter's example, there is no atomic object. – HCSF Jul 12 '18 at 12:10
@HCSF ok but Peter's example is just to show how the code compiles; race condition is a runtime behaviour which requires two threads. (Sure, that code _could_ be involved in a race condition, but so could just about any code which has any non-atomic variables). – davmac Jul 12 '18 at 13:27
1

@HCSF oh, I see what you are saying now, I think. Yes, the fact that the variable is written twice without any intervening atomic operation means that there's no way to ensure that the first value would ever be seen by another thread, even with the fence. If another thread tried to read the value of the variable, without any other synchronisation, it would be undefined behaviour, yes. – davmac Jul 12 '18 at 13:38
@davmac great we are on the same page :) sorry for my confusing wording. And the standard also sounds like using `std::atomic_thread_fence()` without an atomic object is kind of meaningless as far as the semantic/behavior defined by the standard (but the compiler will indeed do something about it in the machine code but it is just a side effect not a defined effect in the standard), right? Just like your answer here indeed works as the compiler does something about `std::atomic_thread_fence()` but it is not defined in the standard -- not to optimizing out the variable and so the loop? – HCSF Jul 12 '18 at 13:50
@HCSF right, as per the discussion with Peter above and the note I added in my answer, the fence by itself doesn't really guarantee anything according to the standard; in practice though it _usually_ seems to act as a compiler barrier, and this makes sense given that a fence (with acquire) used properly (i.e. together with an atomic operation) would naturally impose such a barrier. – davmac Jul 12 '18 at 13:56

Andrew Henle · Answer 3 · 2018-07-10T12:14:57.533

You can use a process-shared mutex. or semaphore.

NAME

pthread_mutexattr_getpshared, pthread_mutexattr_setpshared - get and set the process-shared attribute

SYNOPSIS
#include <pthread.h>

int pthread_mutexattr_getpshared(const pthread_mutexattr_t *
       restrict attr, int *restrict pshared);
int pthread_mutexattr_setpshared(pthread_mutexattr_t *attr,
       int pshared); [Option End]
DESCRIPTION

The pthread_mutexattr_getpshared() function shall obtain the value of the process-shared attribute from the attributes object referenced by attr. The pthread_mutexattr_setpshared() function shall set the process-shared attribute in an initialized attributes object referenced by attr.

The process-shared attribute is set to PTHREAD_PROCESS_SHARED to permit a mutex to be operated upon by any thread that has access to the memory where the mutex is allocated, even if the mutex is allocated in memory that is shared by multiple processes. If the process-shared attribute is PTHREAD_PROCESS_PRIVATE, the mutex shall only be operated upon by threads created within the same process as the thread that initialized the mutex; if threads of differing processes attempt to operate on such a mutex, the behavior is undefined. The default value of the attribute shall be PTHREAD_PROCESS_PRIVATE.

See Condition Variable in Shared Memory - is this code POSIX-conformant? for an example of a process-shared mutex.

For a process-shared semaphore,

NAME

sem_init - initialize an unnamed semaphore (REALTIME)

SYNOPSIS
#include <semaphore.h>

int sem_init(sem_t *sem, int pshared, unsigned value); [Option End]
DESCRIPTION

The sem_init() function shall initialize the unnamed semaphore referred to by sem. The value of the initialized semaphore shall be value. Following a successful call to sem_init(), the semaphore may be used in subsequent calls to sem_wait(), sem_timedwait(), sem_trywait(), sem_post(), and sem_destroy(). This semaphore shall remain usable until the semaphore is destroyed.

If the pshared argument has a non-zero value, then the semaphore is shared between processes; in this case, any process that can access the semaphore sem can use sem for performing sem_wait(), sem_timedwait(), sem_trywait(), sem_post(), and sem_destroy() operations.

See How to share semaphores between processes using shared memory for an example of a process-shared semaphore.

Right. I am interested in the barrier option he mentioned. Let me update my post to better reflect that. Thanks for your answer tho. — HCSF, Jul 10 '18 at 12:13
Haha, I just excluded semaphore 5 sec before your update. Sorry! — HCSF, Jul 10 '18 at 12:15
@HCSF Explicit barriers are going to be platform-specific. A Linux `pthread_mutex_lock` implementation can be found at https://github.com/lattera/glibc/blob/master/nptl/pthread_mutex_lock.c There has to be at least one memory barrier in there somewhere, although it didn't jump out at me. — Andrew Henle, Jul 10 '18 at 12:22
Right. I don't mind the barrier solution to be platform specific and so I explicitly stated x86 and centos 7 in my original post. Thanks for reminding :) — HCSF, Jul 10 '18 at 12:26

Accessing Shared Memory Without Volatile, std::atomic, semaphore, mutex, and spinlock?

3 Answers3

Linked