Are memory barriers necessary for atomic reference counting shared immutable data?

Question

I have some immutable data structures that I would like to manage using reference counts, sharing them across threads on an SMP system.

Here's what the release code looks like:

void avocado_release(struct avocado *p)
{
    if (atomic_dec(p->refcount) == 0) {
        free(p->pit);
        free(p->juicy_innards);
        free(p);
    }
}

Does atomic_dec need a memory barrier in it? If so, what kind of memory barrier?

Additional notes: The application must run on PowerPC and x86, so any processor-specific information is welcomed. I already know about the GCC atomic builtins. As for immutability, the refcount is the only field that changes over the duration of the object.

As a clarification, I'm interested more out of curiosity than out of a need for a working solution. — Dietrich Epp, Apr 08 '10 at 17:44

score 19 · Accepted Answer · edited Nov 20 '20 at 19:18

On x86, it will turn into a lock prefixed assembly instruction, like LOCK XADD.
Being a single instruction, it is non-interruptible. As an added "feature", the lock prefix results in a full memory barrier:

"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." - Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8.1.2.

A memory barrier is in fact implemented as a dummy LOCK OR or LOCK AND in both the .NET and the JAVA JIT on x86/x64, because mfence is slower on many CPUs even when it's guaranteed to be available, like in 64-bit mode. (Does lock xchg have the same behavior as mfence?)
So you have a full fence on x86 as an added bonus, whether you like it or not. :-)

On PPC, it is different. An LL/SC pair - lwarx & stwcx - with a subtraction inside can be used to load the memory operand into a register, subtract one, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted (meaning it will fail and retry).
It also does not mean an automatic full fence.
This does not however compromise the atomicity of the counter in any way.
It just means that in the x86 case, you happen to get a fence as well, "for free".
_{On PPC, one can insert a (partial or) full fence by emitting a (lw)sync instruction.}

All in all, explicit memory barriers are not necessary for the atomic counter to work properly.

@Rachid K. - thanks for fixing typos, but actual code should generally use code formatting, like the x86 `lock` prefix. (It's code instead of just a name because `lock` is part of the asm syntax for using it.) Italics aren't as appropriate here. (Although italics are less visually intrusive in the middle of a paragraph, so I left it that way in your edit to Bruce's answer. In my own answers, I tend to use all-caps for register names or instruction mnemonics in the middle of a paragraph when I don't want the visual noise of code-formatting many words.) — Peter Cordes, Nov 20 '20 at 19:22

score 10 · Answer 2 · edited Nov 20 '20 at 19:13

It is important to distinguish between atomic accesses (which guarantee that the read/modify/write of the value executes as one atomic unit) vs. memory reordering.

Memory barriers prevent reordering of reads and writes. Reordering is completely orthogonal to atomicity. For instance, on PowerPC if you implement the most efficient atomic increment possible then it will not prevent reordering. If you want to prevent reordering then you need an lwsync or sync instruction, or some equivalent high-level (C++ 11?) memory barrier.

Claims that there is "no possibility of the compiler reordering things in a problematic way" seem naive as general statements because compiler optimizations can be quite surprising and because CPUs (PowerPC/ARM/Alpha/MIPS in particular) aggressively reorder memory operations.

A coherent cache doesn't save you either. See https://preshing.com/archives/ to see how memory reordering really works.

In this case, however, I believe the answer is that no barriers are required. That is because for this specific case (reference counting) there is no need for a relationship between the reference count and the other values in the object. The one exception is when the reference count hits zero. At that point it is important to ensure that all updates from other threads are visible to the current thread so a read-acquire barrier may be necessary.

Also see this paper which I wrote several years ago: https://msdn.microsoft.com/en-us/library/windows/desktop/ee418650%28v=vs.85%29.aspx — Bruce Dawson, Feb 20 '15 at 17:24

score 3 · Answer 3 · answered Apr 08 '10 at 11:02

3

Are you intending to implement your own atomic_dec or are you just wondering whether a system-supplied function will behave as you want?

As a general rule, system-supplied atomic increment/decrement facilities will apply whatever memory barriers are required to just do the right thing. You generally don't have to worry about memory barriers unless you are doing something wacky like implementing your own lock-free data structures or an STM library.

answered Apr 08 '10 at 11:02

Marcelo Cantos

181,030
38
327
365

1

I want to know whether memory barriers are necessary in this case, and why. – Dietrich Epp Apr 08 '10 at 11:03
+1 "something" will be required to synchronise access to the refcount field. Whether that "something" is literally a memory barrier, or another similar manipulation of caches, requires trawling through CPU specifications and/or checking the emitted code. It needn't be a full cache flush, perhaps the CPU invalidates just the single cache line that's used. The compiler and CPU each have to ensure instructions aren't re-ordered across the decrement, but the conditional based on the result of the decrement pretty much ensures that anyway. – Steve Jessop Apr 08 '10 at 11:13
2

@Dietrich: in this case, no, because the subsequent operations are conditional on the outcome of the decrement, and there is thus no possibility of the compiler reordering things in a problematic way. Besides, the nature of a refcount is such that, when the count reaches zero, only one thread can possibly have access to the object in question (absent bugs, that is). – Marcelo Cantos Apr 08 '10 at 11:30
AFAIK the memory barrier is only for instruction sequencing and should not have to involve the cache at all. – Per Knytt Apr 08 '10 at 11:37
@Per: yes, without knowing why the questioner wants to know about memory barriers, it's not clear whether the cache behaviour is relevant, or whether he's just asking about the possibility of micro-code instruction re-ordering. I've never quite figured when you need to prevent instruction re-ordering, but you don't care about cache freshness. But I'm prepared to believe it happens :-) – Steve Jessop Apr 08 '10 at 12:17
1

@Steve: I only mention it because people seem to worry unduly about the cache when discussing multithreading correctness. Modern multiprocessors like the x86 systems will take care of it all in hardware. In a cache-coherent system you only need to worry about cache flushing if you're hacking the kernel or a driver for a device doing DMA transfers. It's important for perfomance of course, but not for correctness. – Per Knytt Apr 08 '10 at 12:57
1

Sure: do you happen to know whether multicore PowerPC necessarily has coherent cache? But you're right, atomic is atomic, and whether it's implemented with explicit cache invalidation or coherent cache, or whatever, rarely affects application code. There are things you can do assuming coherent cache: whether you should or not is questionable. – Steve Jessop Apr 08 '10 at 13:40
@SteveJessop It turns out that you never need to care about instruction reordering, but you do need to care about read/write operations being ordered by the cache subsystem, on some processors (ARM/PowerPC/Alpha/MIPS). Some types of reordering even happen on x86/x64. Also, compilers sometimes aggressively reorder instructions because they are allowed to assume single-threaded execution. See preshing.com or this article for scary details: https://msdn.microsoft.com/en-us/library/windows/desktop/ee418650%28v=vs.85%29.aspx – Bruce Dawson Feb 17 '15 at 22:37

Are memory barriers necessary for atomic reference counting shared immutable data?

3 Answers3

Linked