2

Say for example, I have an exclusive atomic-ops-based spin lock implementation as below:

bool TryLock(volatile TInt32 * pFlag)
{
   return !(AtomicOps::Exchange32(pFlag, 1) == 1);
}

void Lock (volatile TInt32 * pFlag) 
{  
    while (AtomicOps::Exchange32(pFlag, 1) ==  1) {
        AtomicOps::ThreadYield();
    }
}

void    Unlock (volatile TInt32 * pFlag)
{
    *pFlag = 0; // is this ok? or here as well a atomicity is needed for load and store    
}

Where AtomicOps::Exchange32 is implemented on windows using InterlockedExchange and on linux using __atomic_exchange_n.

Will
  • 24,082
  • 14
  • 97
  • 108
Abhishek Jain
  • 9,614
  • 5
  • 26
  • 40
  • Related questions: http://stackoverflow.com/questions/1383363/is-my-spin-lock-implementation-correct-and-optimal http://stackoverflow.com/questions/6810733/do-spin-locks-always-require-a-memory-barrier-is-spinning-on-a-memory-barrier-e http://stackoverflow.com/questions/26307071/does-the-c-volatile-keyword-introduce-a-memory-fence – gavv Sep 18 '15 at 18:34
  • Can you please elaborate? Why exactly here, i would need a memory barrier? What goes wrong and how, if i do not use one? – Abhishek Jain Sep 18 '15 at 18:40
  • ```Lock()``` requires an "acquire barrier" to ensure that all changes made while spinlock is locked will be applied only after ```pFlag``` is updated. ```Unlock()``` requires "release barrier" to ensure that all changes made while spinlock was locked will be applied before ```pFlag``` is updated. See details here: https://jfdube.wordpress.com/2012/03/08/understanding-memory-ordering/ . This is a general approach, but on x86 you need only compiler barriers instead of both acquire and release barriers; see here: http://preshing.com/20120913/acquire-and-release-semantics/ – gavv Sep 18 '15 at 19:30
  • I'll add more details in a full answer. – gavv Sep 18 '15 at 19:43
  • See also my remarks about locks in my answer to http://stackoverflow.com/questions/20446982/determining-the-location-for-the-usage-of-barriers-fences . – Arch D. Robison Sep 18 '15 at 19:55
  • I don't think there's any way to sensibly comment on this code without understanding the semantics of the various atomic operations you're using. For example, what memory visibility properties do they have? Does `ThreadYield` yield at scheduler level or CPU level? And so on. – David Schwartz Sep 18 '15 at 20:44
  • @g-v How do you know his `Exchange32` function isn't a full barrier already? – David Schwartz Sep 18 '15 at 20:45
  • David, I don't, thefore I didn't use it in my code example. If it contains an acquire barrier, it can be used as is. If it contains a full barrier, it's better to replace it with acquire barrier, as an optimization. – gavv Sep 19 '15 at 12:27
  • Thanks for your note, I've made it more explicit in answer. – gavv Sep 19 '15 at 12:44

2 Answers2

1

You need two memory barriers in spinlock implementation:

  • "acquire barrier" or "import barrier" in TryLock() and Lock(). It forces operations issued while spinlock is acquired to be visible only after pFlag value is updated.
  • "release barrier" or "export barrier" in Unlock(). It forces operations issued until spinlock was released to be visible before pFlag value is updated.

You also need two compiler barriers for the same reasons.

See this article for details.


This approach is for generic case. On x86/64:

  • there are no separate acquire/release barriers, but only single full barrier (memory fence);
  • there is no need for memory barriers here at all, since this architecture is strongly ordered;
  • you still need compiler barriers.

More details are provided here.


Below is an example implementation using GCC atomic builtins. It will work for all architectures supported by GCC:

  • it will insert acquire/release memory barriers on architectures where they are required (or full barrier if acquire/release barriers are not supported but architecture is weakly ordered);
  • it will insert compiler barriers on all architectures.

Code:

bool TryLock(volatile bool* pFlag)
{
   // acquire memory barrier and compiler barrier
   return !__atomic_test_and_set(pFlag, __ATOMIC_ACQUIRE);
}

void Lock(volatile bool* pFlag) 
{  
    for (;;) {
        // acquire memory barrier and compiler barrier
        if (!__atomic_test_and_set(pFlag, __ATOMIC_ACQUIRE)) {
            return;
        }

        // relaxed waiting, usually no memory barriers (optional)
        while (__atomic_load_n(pFlag, __ATOMIC_RELAXED)) {
            CPU_RELAX();
        }
    }
}

void Unlock(volatile bool* pFlag)
{
    // release memory barrier and compiler barrier
    __atomic_clear(pFlag, __ATOMIC_RELEASE);
}

For "relaxed waiting" loop, see this and this questions.

See also Linux kernel memory barriers as a good reference.


In your implementation:

  • Lock() calls AtomicOps::Exchange32() which already includes compiler barrier and perhaps acquire or full memory barrier (we don't know because you didn't provide actual arguments to __atomic_exchange_n()).
  • Unlock() misses both memory and compiler barriers so it's broken.

Also consider using pthread_spin_lock() if it is an option.

Community
  • 1
  • 1
gavv
  • 4,649
  • 1
  • 23
  • 40
  • By "optimization barriers" do you mean compiler barriers? The answer already mentions that compiler barriers are always required, but I'll add a clarification. – gavv Sep 19 '15 at 12:22
  • How is it guaranteed that the “efficient waiting” loop will eventually exit without an acquire barrier? – 5gon12eder Sep 19 '15 at 12:51
  • @5gon12eder, isn't ```volatile``` enough here? Are there architectures where a write made by one CPU is not visible to another CPU without a memory barrier? (so they do not provide cache coherency?) – gavv Sep 19 '15 at 13:15
  • I don't know whether such architecture exists but as far as the language is concerned, I don't think it's safe. Assuming that [this answer](http://stackoverflow.com/a/25212461/1392132) is correct, it also seems that Intel doesn't recommend it. – 5gon12eder Sep 19 '15 at 13:23
  • I often saw such an implementation, including linux kernel for x86 (see ```arch_spin_lock()``` which uses ```cpu_relax()```). Although I have no proof that it's safe on all architectures, say, supported by GCC, it likely is: see here: http://stackoverflow.com/questions/11282899/memory-barrier-and-cache-flush – gavv Sep 19 '15 at 13:36
  • 1
    I have posted a question: http://stackoverflow.com/questions/32677667/is-memory-barrier-or-atomic-operation-required-in-a-busy-wait-loop – gavv Sep 20 '15 at 09:11
  • @5gon12eder, I've updated the answer with a safe implementation. If it were C++11 atomics, the code would be correct according to C++ memory model. Thanks for your comment about ```volatile```! – gavv Sep 21 '15 at 07:55
  • Yes, that looks safe now. It didn't occur to me that one could use a loop of atomic loads with relaxed memory order as a possibly lower-overhead alternative to repetitively trying an atomic exchange with acquire semantics. I already did an atomic increment of your question's vote score so I can't do it again now. – 5gon12eder Sep 21 '15 at 09:58
1

In most cases, for releasing the resource, just resetting the lock to zero (as you do) is almost OK (e.g. on an Intel Core processor) but you need also to make sure that the compiler will not exchange instructions (see below, see also g-v's post). If you want to be rigorous (and portable), there are two things that need to be considered :

What the compiler does: It may exchange instructions for optimizing the code, and thus introduce some subtle bugs if it is not "aware" of the multithreaded nature of the code. To avoid that, it is possible to insert a compiler barrier.

What the processor does: Some processors (like Intel Itanium, used in professional servers, or ARM processors used in smart phones) have a so-called "relaxed memory model". In practice, it means that the processor may decide to change the order of the operations. Again, this can be avoided by using special instructions (load barrier and store barrier). For instance, in an ARM processor, the instruction DMB ensures that all store operations are completed before the next instruction (and it needs to be inserted in the function that releases a lock)

Conclusion: It is very tricky to make the code correct, if you have some compiler / OS support for these functionalities (e.g., stdatomics.h, or std::atomic in C++0x), it is much better to rely on them than writing your own (but sometimes you have no choice). In the specific case of standard Intel Core processor, I think that what you do is correct, provided you insert a compiler-barrier in the release operation (see g-v's post).

On compile-time versus run-time memory ordering, see: https://en.wikipedia.org/wiki/Memory_ordering

My code for some atomic / spinlocks implemented on different architectures: http://alice.loria.fr/software/geogram/doc/html/atomics_8h.html (but I'm unsure it's 100 % correct)

5gon12eder
  • 24,280
  • 5
  • 45
  • 92
BrunoLevy
  • 2,495
  • 17
  • 30
  • 2
    "In the specific case of standard Intel Core processor, I think that what you do is correct." -- As mentioned in comments and my answer, it is *not* without a compiler barrier (```volatile``` is not enough here). With a compiler barrier, it will be correct for x86. – gavv Sep 19 '15 at 14:06