Pushing some fences in is not enough to grant atomicity.
For a single threaded code there's not real benefit to them, the CPU would know to order the loads and stored internally to achieve correct execution as it the core ran serially (even though in reality, most modern CPUs would run it out if order).
The benefit of fences might come in scenarios such as this -
thread1: | thread 2:
store [x],1 | store [y],1
load [y] -> r1 | load [x] -> r2
This is a classic example for memory consistency issues - the possible outcomes the programmer would expect if reading the 2 registers would be 1,1 (both stores occurred first, then both loads), or 1,0 or 0,1 (if one of the threads ran ahead of the other. What you don't expect is 0,0, since at least one of the threads should have done the write. However, with relaxed memory ordering this might be possible - the loads are done early along the pipe, and the stores are very late. As there's no intra-thread aliasing in the addresses (assume x!=y), there's nothing the CPU does to prevent that.
Adding fences as below would guarantee that if one of the threads reached the load, the preceding store must have been dispatches and observed. This means that you can still get 0,1 and 1,0 (if both store-fence-load complete in one thread first), and of course 1,1, but you can't have 0,0 anymore.
thread1: | thread 2:
store [x],1 | store [y],1
mfence | mfence
load [y] -> r1 | load [x] -> r2
See also - http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
However, you requested atomicity - this is stronger, let's take your example -
BTS WORD PTR [addr], 0
MFENCE
If we replicate it into 2 threads, it's essentially like before, except that the fence goes after the load and store (the fact that they're grouped into the same instruction doesn't change the basic operations done). What's to stop you from doing both reads first, reading 0 on both threads, and then doing the stores (which would involve some MESI-state race in your caches, as both threads would compete for ownership if they're on different cores), but will ultimately result in both stores writing to that line. Then you can go perform the mfences all you want, that's not going to save you from the already broken atomicity.
What would guarantee atomicity is a good old decent lock. The threads won't be able to simultaneously share the line even for the reads that way. It's usually considered a slow but necessary evil, but some modern CPUs may even optimize them away in HW! See - http://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions
EDIT:
After searching a bit, I believe what led to this question is related to how atomic keyword is defined in c++11. These links - Concurrency: Atomic and volatile in C++11 memory model and http://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/ , indicate that some of the implementations are done through pushing mfences after the store. However, I don't think this pretends to imply any regular (non-library) operation done on an atomic variable is bound to be atomic. Anyway, this mechanism is supposed to provide multiple memory consistency models, so we'll need to be more specific here
EDIT2:
There does seem to be a large "movement" (not sure how to call them :) trying to reduce the necessity of locks, here's an interesting piece: http://preshing.com/20120612/an-introduction-to-lock-free-programming/ . This is mostly about SW design and being able to differentiate the real potential data races, but the bottom line seems to be that there will always be some locks required. The c++11 additions, while making life easier for a given consistency model and removing the need for the programmer to implement HW specific solution, might still be forced to fall into the old solution. Quote: Be aware that the C++11 atomic standard does not guarantee that the implementation will be lock-free on every platform
.