x86_64 memory barrier on single core

Question

On x86_64, the intel documentation, section 8.2.3.2, vol 3A, says:

The Intel-64 memory-ordering model allows neither loads nor stores to be reordered with the same kind of operation. That is, it ensures that loads are seen in program order and that stores are seen in program order

I need to be sure that a variable won't be rearranged while writing to a memory address.

I want to avoid atomic xchg because of the high cost it involves. And in my application, the other cpu reading that value knows how to deal with an incomplete state.

Some code:

cli();
compiler_unoptimization(); // asm volatile("":::"memory")
volatile uint *p = 0x86648664; // address doesn't matter
*p = 1;
... // some code here
*p = 0;
sti();

So, am I right assuming that:

1) the cpu won't make *p = 0 before *p = 1, without the need of a sfence
2) the compiler (gcc or clang) won't inverse the p writing either with the asm trick (which is needed here, right?).

It seems all you're asking is whether `*p` will be `1` when you're done. If so, isn't it obvious that that must be guaranteed no matter what? I honestly can't figure out what it is that you need from this code, so I can't figure out whether it gives it to you or not. — David Schwartz, Nov 20 '15 at 11:12
@Kroma: So what? Did you even have a look what `stdatomic.h`/`_Atomic` actually is? — too honest for this site, Nov 20 '15 at 11:38
@Olaf, why are you so defensive? and offensive? I won't downvote even if i disagree! — Kroma, Nov 20 '15 at 17:20
As @Olaf mentioned, it's probably best to just use `stdatomic.h`. The x86-64 architecture is strongly ordered, but there are corner cases to consider like non-atomic reads and writes, plus stores being re-ordered after loads. As an added benefit, it makes your code more portable. — Jason, Nov 20 '15 at 18:09
What is the relationship between `*p` can the rest of the code? — curiousguy, Dec 02 '19 at 19:21

score 3 · Accepted Answer · answered Nov 20 '15 at 16:17

3

While the C standard guarantees issuing the accesses for volatile objects in-order, it does not guarantee it compared to non-volatile objects.

You have both accesses here volatile, so the compiler has to generate these in-order, but anything in the ellipsis can be moved around freely **unless these are volatile, too!

Also volatile does not imply, the hardware will execute in-order as of the C standard. This would be guaranteed by an appropriate barrier for the CPU, but - depending on the architecture and barrier - it may not suffice for the rest of the hardware (caches, busses, memory system, etc.

For x86, ordering is guaranteed (not typical, though: many RISCs like e.g. ARM and PPC are more relaxed, thus require more carefully written code). As you only refer to a single CPU and volatile has no side-effect here outside it, the memory system is not relevant. So you are on the safe side here.

Things are much more complicated for memory-mapped peripherals and multiprocessors, i.e. if you have side-effects beyond the single CPU. Simple example: the first write may not go past the CPU cache, so anything reading the same memory page may only see the second write or none at all. volatile will be not enough here, you need atomic accesses and (possible) barriers.

For your code, you can either make all variables in the ellipsis volatile (inefficient), or add compiler barriers around them (after *p = 1; and before *p = 0;). This way the compiler will not move instructions beyond the barrier.

Finally: volatile does not guarantee atomic accesses. Thus, *p may not be written by a single instruction. (I would not emphasise this too much, as I assume uint is unsigned int, which is normally 32 bits on 32 or 64 bit x86 targets, but it will be an issue for 8 or 16 bit CPUs.) To be on the safe-side, use _Atomic types (since C11).

PS: Types like uint. The standard type unsigned is not significantly more to type, but everyone instantly knows what you mean. If you need a specific width, use stdint.h types. Here, you should even use _Bool/bool, as you seem to have just a single true/false flag.

Note that all those features are available for low-level code, too. Especially _Atomic (see stdatomic.h, too) are meant for such porpose and do normally not need any special libraries. Their usage is often not more complicated than the non-qualified types if they can be stored atomically, too (there are also macros which signal if a specific type is atomic anyway).

answered Nov 20 '15 at 16:17

too honest for this site

12,050
4
30
52

Not only I +1 your answer, but I make it the answer to this question. – Kroma Nov 20 '15 at 17:23
BTW, all the section which are communicating with IO /other cpus are adjusted with PAT to uncached. I think it solves some issues there. Don't you think? – Kroma Nov 20 '15 at 17:33
Well, you should make it strictly ordered, too. However, If that is an OS for x86, restricting to a single CPU might be a bad idea. It is much harder to add multi-CPU support later than from the beginning. Allthough the start will be harder, of course. Note that Linux started as some practice with the protected mode;-) – too honest for this site Nov 20 '15 at 17:41
:) Yes you are right, I will use atomics when I will have to deal with the "exterior", and simple reorderings otherwise. Thanks Olaf. – Kroma Nov 20 '15 at 17:43
Please have a look at them asap. You might find them more useful than you thought. At least you do not have to fiddle with CPU-intrinsics. – too honest for this site Nov 20 '15 at 17:48
@Kroma: using uncached memory for shared variables is a *terrible* idea. x86 *is* cache-coherent. Reads on one core will see the stores from another core in the same order the other core did them (in program order). In HW, this is implemented by Intel with an *inclusive* L3 cache, so checking the L3 tags tells a core if any *other* core has a modified copy. See http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ to start getting a handle on memory ordering and http://stackoverflow.com/questions/32705169/does-the-intel-memory-model-make-sfence-and-lfence-redundant – Peter Cordes Nov 20 '15 at 22:20
@PeterCordes: I agree about not using the cache (x86 without cahcing would likely be slower than some embeded CPUs) (I did not pay atttention to this part of OPs comment). However, I'd think twice on relying on a speicifc behaviour if atomics would be the more universal (and portable) approach. But that would certainly be subject of a whole book. – too honest for this site Nov 20 '15 at 22:31
@Olaf: yes, absolutely the OP should code in terms of source barriers that turn into barrier instructions on weakly ordered arches (or load-acquire / store-release instructions on ARM64). On strongly-ordered x86, source barriers (other than a full seq_cst barrier) will just stop the compiler from reordering anything you need to stay ordered. – Peter Cordes Nov 20 '15 at 22:38
@PeterCordes: If you mean compiler barriers with "source barriers": they will not be sufficient. That's why one should use `stdatomics`. they provide a whole zoo of memory-orderings, suitable for every variant. Any way, all this requires quite some knowledge how memory accesses, concurrency, etc. works. But for an OS you cannot get along without this anyway. – too honest for this site Nov 20 '15 at 22:42
@PeterCordes: Just curious: Does ARMv8 not provide LDREX/STREX from anymore? – too honest for this site Nov 20 '15 at 22:46
@Olaf: I meant a C equivalent to C++ std::atomic stuff, not just a compiler barrier. I meant "use proper barriers in your source". Is there a better phrase for that than "source barriers"?. **re:ARM**: I'm not *really* an ARM guy, I mostly just look at gcc output to see what gcc does on weakly-ordered architectures. But for an increment that only has to be atomic (memory_order_relaxed), [gcc 4.8 for ARM64 uses a `ldxr/add/stxr/cbnz` loop](https://goo.gl/xMuZFn) for `modify_relaxed()` (from [an answer of mine](http://stackoverflow.com/q/32384901/224132)). It uses ldrex/strex for ARM32. – Peter Cordes Nov 20 '15 at 22:58
@PeterCordes: That equivalent is what I already mentined: `stdatomics` and the `_Atomic` qualifier. Quite the same as the C++11 version (it was actually designed to be). I do not know the term "source barrier", just "compiler barrier", but that might be the same. Mostly obsolete (but sometimes easier) with `stdatomics` using the proper memory model. in gcc you can use `asm volatile( "" :::memory)`. Re ARMv8: Hmm, the name is definitively different. Well I'll have a closer look at ARMv8 once I have some free time - or I'll work with it ;-) – too honest for this site Nov 20 '15 at 23:09
@PeterCordes: ok, I definitely won't mess with the PAT and use the stdatomic header. Thanks :) – Kroma Nov 21 '15 at 07:39
"_unless these are volatile_" which is why volatile is utterly inadequate even for low level program (like drivers, MMIO); volatile suppresses all optimization at each sequence point but we only want barriers: a way to tell the compiler to stop optimizing at a point. Then things like MMIO can use normal pointers use, bracketed by barriers. – curiousguy Dec 03 '19 at 01:41
@curiousguy That's fundamentally wrong. If you omit ´volatile´ for MMIO, even with barriers (which exactly, btw?), the compiler is still free to eliminate the access which is hardly what you want. E.g. consider something like ´barrier(); (void)MMIO_SR; (barrier();´ would most likely result in no access at all for a modern compiler. So, no, you still need ´volatile´ in C and that's exatly what it is meant for. In general. ´_Atomic` is problematic here, because - depending on the hardware, ARM e.g. - it can result in 1..N accesses if interrupts hit in-between. – too honest for this site Mar 06 '21 at 18:12
@curiousguy: To elaborate a bit: barriers do not tell the compiler to stop optimizing. They do not have an effect beyond the barrier macro/cunction, as they are not compound statements like in some other (less common) languages. They only have an effect right at the sequence point they occur. Everrything before or after them is subject to the rules of the abstract machine. – too honest for this site Mar 06 '21 at 18:19

x86_64 memory barrier on single core

Some code:

1 Answers1