Overhead of a Memory Barrier / Fence

Question

I'm currently writing C++ code and use a lot of memory barriers / fences in my code. I know, that a MB tolds the compiler and the hardware to not reorder write/reads around it. But i don't know how complex this operation is for the processor at runtime.

My Question is: What is the runtime-overhead of such a barrier? I didn't found any useful answer with google... Is the overhead negligible? Or leads heavy usage of MBs to serious performance problems?

Best regards.

score 3 · Answer 1 · answered Nov 29 '09 at 14:42

Compared to arithmetic and "normal" instructions I understand these to be very costly, but do not have numbers to back up that statement. I like jalf's answer by describing effects of the instructions, and would like to add a bit.

There are in general a few different types of barriers, so understanding the differences could be helpful. A barrier like the one that jalf mentioned is required for example in a mutex implementation before clearing the lock word (lwsync on ppc, or st4.rel on ia64 for example). All reads and writes must be complete, and only instructions later in the pipeline that have no memory access and no dependencies on in progress memory operations can be executed.

Another type of barrier is the sort that you'd use in a mutex implementation when acquiring a lock (examples, isync on ppc, or instr.acq on ia64). This has an effect on future instructions, so if a non-dependent load has been prefetched it must be discarded. Example:

if ( pSharedMem->atomic.bit_is_set() ) // use a bit to flag that somethingElse is "ready"
{
   foo( pSharedMem->somethingElse ) ;
}

Without an acquire barrier (borrowing ia64 lingo), your program may have unexpected results if somethingElse made it into a register before the check of the flagging bit check is complete.

There is a third type of barrier, generally less used, and is required to enforce store load ordering. Examples of instructions for such an ordering enforcing instruction are, sync on ppc (heavyweight sync), MF on ia64, membar #storeload on sparc (required even for TSO).

Using ia64 like pseudocode to illustrate, suppose one had

st4.rel
ld4.acq

without an mf in between one has no guarentee that the load follows the store. You know that loads and stores preceding the st4.rel are done before that store or the "subsequent" load, but that load or other future loads (and perhaps stores if non-dependent?) could sneak in, completing earlier since nothing prevents that otherwise.

Because mutex implementations very likely only use acquire and release barriers in thier implementations, I'd expect that an observable effect of this is that memory access following lock release may actually sometimes occur while "still in the critical section".

Very interesting answer from 10 years ago! I think I found that effect today with ConcurrentQueue and AutoResetEvent. After EnQueue'ing and calling event.Set, VERY rarely (every 12-24 hours) does the other side return from WaitOne( ) only to find TryDequeue( ) returns false. I found if I output the queue.Count property several times I can see it magically change from 0 to 1. — Brain2000, Jan 13 '19 at 19:15
@Brain2000 This shouldn't happen. With release semantics on Set, everything that happened before on that thread should have been "committed" before Set. Likewise, with acquire semantics on Wait, all reads following the Wait should have "checked out" fresh data. There's either a bug in your code or the library you're using. — relatively_random, Sep 17 '20 at 09:35
@relatively_random This is my own code. I switched to BlockingCollection and all the issues disappeared. I do believe there is a rare race condition with ConcurrentQueue or AutoResetEvent. My code is one of the answers here: https://stackoverflow.com/questions/1475747/is-there-an-in-memory-stream-that-blocks-like-a-file-stream — Brain2000, Sep 30 '20 at 18:25

jalf · Answer 2 · 2009-11-29T12:47:34.983

1

Try thinking about what the instruction does. It doesn't make the CPU do anything complicated in terms of logic, but it forces it to wait until all reads and writes have been committed to main memory. So the cost really depends on the cost of accessing main memory (and the number of outstanding reads/writes).

Accessing main memory is generally pretty expensive (10-200 clock cycles), but in a sense, that work would have to be done without the barrier as well, it could just be hidden by executing some other instructions simultaneously so you didn't feel the cost so much.

It also limits the CPU's (and compilers) ability to reschedule instructions, so there may be an indirect cost as well in that nearby instructions can't be interleaved which might otherwise yield a more efficient execution schedule.

edited Nov 29 '09 at 12:47

answered Nov 29 '09 at 11:34

jalf

243,077
51
345
550

6

This answer is not right. The sentence "all reads and writes have been committed to main memory" is incorrect. Main memory is not impacted by a memory fence. At a minimum, with cache coherence, why go all the way to main memory? And the cycle times for main memory access are WAY off. – Timoteo Nov 13 '14 at 18:34
@Timoteo The main memory is affected by a memory fence, but the impact on reads and writes depends on the type of fence used. – grayxu Mar 27 '23 at 15:36

Overhead of a Memory Barrier / Fence

2 Answers2

Linked