How do fences atomize load-modify-store operations?

Question

I have read, putting a fence instruction after a load-modify-store one, like BTS, makes you can treat the second one atomic. But according to the Intel's documentation, the fence instructions are described as

(MFENCE)

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction.

So, how does such behavior guarantee the mentioned "atomicity"?

Specifically, if we have two simultaneous runs of the following code performed by distinct processors, how would the fence prevent from reading 0 into CF in both cases?

start memory assumption: [addr] contains the word 0

BTS WORD PTR [addr], 0
MFENCE

Can you post a link to what you quote? fences would enforce ordering with regards to the same thread (the rely on program order). On a multithreaded system this is not enough to achieve atomicity — Leeor, Sep 27 '13 at 21:13
I thought so. I've read about using them to atomize at some mailing lists. The posts were old and I do not think, they came from really serious guys, so maybe no one was taking multi-processors machines at consideration. — Number47, Sep 27 '13 at 21:28

score 4 · Accepted Answer · edited May 23 '17 at 12:21

Pushing some fences in is not enough to grant atomicity.

For a single threaded code there's not real benefit to them, the CPU would know to order the loads and stored internally to achieve correct execution as it the core ran serially (even though in reality, most modern CPUs would run it out if order).

The benefit of fences might come in scenarios such as this -

thread1:                    |         thread 2:
    store [x],1             |             store [y],1
    load [y] -> r1          |             load [x] -> r2

This is a classic example for memory consistency issues - the possible outcomes the programmer would expect if reading the 2 registers would be 1,1 (both stores occurred first, then both loads), or 1,0 or 0,1 (if one of the threads ran ahead of the other. What you don't expect is 0,0, since at least one of the threads should have done the write. However, with relaxed memory ordering this might be possible - the loads are done early along the pipe, and the stores are very late. As there's no intra-thread aliasing in the addresses (assume x!=y), there's nothing the CPU does to prevent that.

Adding fences as below would guarantee that if one of the threads reached the load, the preceding store must have been dispatches and observed. This means that you can still get 0,1 and 1,0 (if both store-fence-load complete in one thread first), and of course 1,1, but you can't have 0,0 anymore.

thread1:                    |         thread 2:
    store [x],1             |             store [y],1
    mfence                  |             mfence
    load [y] -> r1          |             load [x] -> r2

See also - http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/

However, you requested atomicity - this is stronger, let's take your example -

BTS WORD PTR [addr], 0
MFENCE

If we replicate it into 2 threads, it's essentially like before, except that the fence goes after the load and store (the fact that they're grouped into the same instruction doesn't change the basic operations done). What's to stop you from doing both reads first, reading 0 on both threads, and then doing the stores (which would involve some MESI-state race in your caches, as both threads would compete for ownership if they're on different cores), but will ultimately result in both stores writing to that line. Then you can go perform the mfences all you want, that's not going to save you from the already broken atomicity.

What would guarantee atomicity is a good old decent lock. The threads won't be able to simultaneously share the line even for the reads that way. It's usually considered a slow but necessary evil, but some modern CPUs may even optimize them away in HW! See - http://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions

EDIT: After searching a bit, I believe what led to this question is related to how atomic keyword is defined in c++11. These links - Concurrency: Atomic and volatile in C++11 memory model and http://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/ , indicate that some of the implementations are done through pushing mfences after the store. However, I don't think this pretends to imply any regular (non-library) operation done on an atomic variable is bound to be atomic. Anyway, this mechanism is supposed to provide multiple memory consistency models, so we'll need to be more specific here

EDIT2: There does seem to be a large "movement" (not sure how to call them :) trying to reduce the necessity of locks, here's an interesting piece: http://preshing.com/20120612/an-introduction-to-lock-free-programming/ . This is mostly about SW design and being able to differentiate the real potential data races, but the bottom line seems to be that there will always be some locks required. The c++11 additions, while making life easier for a given consistency model and removing the need for the programmer to implement HW specific solution, might still be forced to fall into the old solution. Quote: Be aware that the C++11 atomic standard does not guarantee that the implementation will be lock-free on every platform.

More specific. OK. I want to make mutexes which will guard some critical sections in objects. A critical section is per object and I have few kinds of them in one object, so I need more then one mutex. Also have millions of such objects. That is why I want to implement single-bit mutexes to save memory. (Actively waiting.) -- My approach was to just use `LOCK BTS` + `JC`, but now am wondering, could I get rid of `LOCK`, as it's little horrible to lose 100 cycles to lock 10 instructions critical sections. — Number47, Sep 28 '13 at 13:22
Well, i'm a little skeptic (another argument - locks wouldn't have been that much of a nuisance if we weren't forced to use them all the time). However, I'm not an expert in c++11, try to lookup std::atomic, and if this sounds like what you've read, retag or ask again to get an expert on that field - i'd be also interested in hearing the answer. Keep in mind though that there's a question of individual load or store atomicity (for example when unaligned/split), and there's the question of RMW atomicity (which usually requires HW locks). — Leeor, Sep 28 '13 at 16:53
Here's another useful link regarding what you want: http://preshing.com/20120612/an-introduction-to-lock-free-programming/ . Quote: `Be aware that the C++11 atomic standard does not guarantee that the implementation will be lock-free on every platform` — Leeor, Sep 28 '13 at 16:56
Unfortunately, those lock-free approaches appear to be lock free from semantical point of view, but not from implementational. http://preshing.com/20120226/roll-your-own-lightweight-mutex/ Here is disassembly which shows the _InterlockedIncrement, which is considered in the above article as lock-free, generates `LOCK XADD`. — Number47, Sep 28 '13 at 18:18
sorry, like I said - if it was that simple, no one would have used locks :) — Leeor, Sep 28 '13 at 18:27
Actually, I have a feeling, that our wonderful corporation wanted to increase speed of their CPUs so much, that they forgot about the sense of reason. IMO it is insane, you can't consider even a single processor instruction atomic. -- For example, I actually need to do some incrementation and decrementation of a counter. And there is no other code around to put in some critical section. So it is like I need a critical section (`LOCK` I consider a mini critical section :) ) for single increment/decrement instruction. — Number47, Sep 28 '13 at 18:39
No, some instructions are atomic, loading into a register for example, or storing one into memory (both cases only if aligned properly), or any internal operation. Anything that isn't running the risk of multithreaded race doesn't require special protection, so you don't have to use this for *any* counter increment, just the tiny counters that protects your critical section. The art of MT programming is minimizing such collisions. Also, please read the TSX link above, it may change everything. — Leeor, Sep 28 '13 at 18:50
In deed looks nice for implementing small critical sections. But still it will be probably heavy as locks. -- And I am in a really not nice situation now, where I need to do those incrementations and decrementations I mentioned before most of method calls in the application. -- I have a memory management which shifts objects to optimize space. Yeah, very uncommon, but the memory usage of the application is so big, that it can consume whole 32bit addressable space, so I really can't lose memory because of fragmentation. -- So even this would not solve the problem. — Number47, Sep 28 '13 at 19:28

How do fences atomize load-modify-store operations?

1 Answers1

Linked