What is the impact of memory barriers against incrementing code?

Question

Consider the following program: 2 threads are iterating through the same function that consists of incrementing the value of a shared counter variable. There's no lock protecting the variable, as such we're talking about lock-free programming. We're also ensuring that the threads will run on different cores/CPUs. The number of iterations is sufficiently large (eg N=100,000).

The operations themselves are below, listed as pseudocode. As expected, there will be various delays between the instructions, depending on what other things the CPUs do. The one below is just one possible way of running them.

      CPU 0           |        CPU 1
------------------------------------------
  LOAD count          |
  INC count           |     LOAD count
                      |     INC count
                      |     STORE count
  STORE count         |

Let's not target only the x86 architecture, where the memory model is pretty strong. In fact, let's consider a "memory-ordering-hostile" architecture (as per C.6.1 of McKenney's book).

The main problem with this code is that - without exception - the end result will be wrong. The race condition will make it often enough that one CPU will compute the new counter value at the same time as the other one does the same thing, based on the same count value. The result is that each CPU will write back to the corresponding cache line an incremented value of count, but the same one. This doesn't contradict the MESI protocol of cache consistency, as each CPU gets the cache line exclusively and writes to it in sequence; the only unfortunate thing is that it's the same counter value being written.

What I'm interested however is the impact of placing memory barriers. Excluding the problem in the preceding paragraph, will the fact that memory barriers aren't put in place (or they're placed poorly) bring their own "negative" contribution to how this program works ?

Thinking intuitively about store buffers, and the fact that the values in there probably can't be "skipped" or "lost", they'll eventually have to be written to the cache line. So write barriers won't have an impact. Would the invalidate queues and the read barriers have no impact as well ? Is my assumption correct ? Am I missing something ?

I would expect this to be system specific, what system is this? I would expect the memory barrier to only affect the nearby core and its L1 cache at best, but perhaps per a system design it extends deeper, if it doesnt then you will always have this problem without a lock. — old_timer, Aug 04 '20 at 15:26
What does `INC count` do? More specifically why is there what appears as memory operand? — Erik Eidt, Aug 04 '20 at 15:35
@old_timer I don't have a specific system in mind, this is why I resorted to the "memory-ordering-hostile" one :) Not sure if what I'm asking is clear, as I might have worded my initial question poorly: Would the program's result be closer to the real value (for N=100,000 iterations, a count value of 200,000) should memory barriers be strategically placed ? Or it would make no difference whatsoever ? I agree 100% that the lock would solve the main problem I referenced in my initial question, but I'm just trying to grasp the impact of the memory barriers — Mihai Albert, Aug 04 '20 at 15:37
@ErikEidt I meant `INC count` to increase the value that was just loaded before. To make things meaningful, we probably have to consider that it's a memory register where `count` was placed after the load. Otherwise, if the memory was addressed directly, the cache line would be placed in 'exclusive', then 'modified'. As only one CPU is allowed to do this in turn, there won't be any point in using memory barriers. — Mihai Albert, Aug 04 '20 at 15:42
The program can't be fixed by the addition of memory barriers, so the program isn't broken by the lack of them. — Matt Timmermans, Aug 04 '20 at 15:45
@MattTimmermans I agree. But would adding memory barriers make things worse or better as opposed to the initial code ? — Mihai Albert, Aug 04 '20 at 15:48
@MihaiAlbert memory barriers are specific to the chip/architecture design as to what they do so there isnt a generic answer. A specfic multi-core chip may have one result where another may have a different, looking at the top of the answer I see below certainly memory barriers dont create atomicity, nor do they create any kind of locking between cores. — old_timer, Aug 04 '20 at 15:51
what the memory barrier does for a particular core/chip/system should be documented for that core/chip/system,my experience it is more of a flushing thing, with modern cores you can have many transactions in flight at the same time, and you may wish to do something (flush the cache for example) that you want the transactions in flight to be completed before you do the next thing. — old_timer, Aug 04 '20 at 15:53
this does not mean that one core gets a jump start on the shared location, other cores if involved in the barrier could have just been stalled at the same point in the code so without more detail I would think that barriers simply make the code run slower. slower technically reduces failures per unit time, but that would be the wrong way to look at it. — old_timer, Aug 04 '20 at 15:56
`LOAD count; INC; STORE count` makes sense, or else just `INC count` w/o load and store, but `LOAD count; INC count; STORE count` is problematic. — Erik Eidt, Aug 04 '20 at 16:46
Btw, there's no "better" or "worse". There's just broken/buggy or not.. — Erik Eidt, Aug 04 '20 at 16:52

score 2 · Accepted Answer · answered Aug 04 '20 at 15:46

You're correct, memory barriers can't create atomicity. They only order this core's own accesses to its L1d cache (e.g. drain the store buffer before later stores or loads = full barrier, or wait for earlier loads to read cache before any later loads and stores can execute = light-weight barrier). They don't bundle together multiple instructions into an atomic RMW transaction.

To create atomicity wrt. anything another core can do, you need this core to keep the cache line in MESI Exclusive or Modified state from the load to the store (Can num++ be atomic for 'int num'?). Barriers don't do that, you need special asm instructions like x86 lock add dword [mem], 1 or on many RISC-like machines, an LL/SC retry loop that aborts the store if the cache line hasn't stayed exclusive to this core since the load-linked.

Memory barriers are important for C++ std::atomic because that also implies ordering (acquire, release, or seq_cst), unless you use memory_order_relaxed in which case compilers will never use barrier instructions.

What is the impact of memory barriers against incrementing code?

1 Answers1

Related