42

I implemented SpinLock class, as followed

struct Node {
    int number;
    std::atomic_bool latch;

    void add() {
        lock();
        number++;
        unlock();
    }
    void lock() {
        bool unlatched = false;
        while(!latch.compare_exchange_weak(unlatched, true, std::memory_order_acquire));
    }
    void unlock() {
        latch.store(false , std::memory_order_release);
    }
};

I implemented above class and made two threads which call add() method of a same instance of Node class 10 million times per thread.

the result is , unfortunately, not 20 million. What am I missing here?

Gabriel Staples
  • 36,492
  • 15
  • 194
  • 265
syko
  • 3,477
  • 5
  • 28
  • 51
  • 5
    Be warned, this is about the worst spinlock you can possibly implement. Among other problems: 1) When you finally do acquire the `lock`, you take the mother of all mispredicted branches leaving the `while` loop, which is the worst possible time for such a thing. 2) The `lock` function can starve another thread running in the same virtual core on a hyper-threaded CPU. – David Schwartz Apr 20 '15 at 10:50
  • 1
    @DavidSchwartz, Thank you for your comments. May I ask more on the issues you mentioned? 2). That lock function can starve another thread, (that's right!, and it is meant to do that since I am sure the lock's life time is very short). I can use some spin counters to east this problem, am I right? 1). Why I take 'the mother of all mispredicated brabches' in this code?, and how can I improve it?? do you have any comments on it? Thank you again – syko Apr 26 '15 at 05:01
  • 2) No. None of those things fix the problem. Using the `pause` instruction does. 1) Because the CPU predicts the `while` loop to keep looping. And, as it happens, you can also fix this with the `pause` instruction. – David Schwartz Apr 26 '15 at 20:21
  • @DavidSchwartz, I got what you mean. Maybe I need some googling to figure out how 'pause' works and how it prevent branch misprediction. – syko Apr 27 '15 at 01:52
  • 2
    Pause prevents speculative execution, eliminating the branch misprediction penalty. (And, by the way, if you don't know this kind of stuff inside out, you have no business writing spinlocks. You will make every mistake and not even realize you had other choices.) – David Schwartz Apr 27 '15 at 08:50
  • 4
    @DavidSchwartz, I'm interested in writing these kinds of things but I don't know this kind of stuff inside out. Can you recommend how a person can make themselves know this inside out? – codeshot Jun 01 '16 at 00:22
  • @codeshot I don't really have a good answer to this, other than carefully studying the high-quality work of others and gaining lots of experience. This really isn't a pool amateurs can play in, beyond writing toy code. I myself only know enough to be dangerous. ;) – David Schwartz Jun 01 '16 at 00:54
  • 7
    @DavidSchwartz, It sounds like the only way to become knowledgeable enough to be allowed to try writing this kind of code is to try writing this kind of code. – codeshot Jun 04 '16 at 11:41
  • 1
    @codeshot Well, there's never any harm in trying to do any kind of coding. Just have realistic expectations for the quality of what you produce and don't delude yourself into thinking you're doing it right unless you are. To some extent it's like rolling your own encryption. It's very easy to do it badly and sometimes very hard to point out what's bad about a particular effort. And way too many people think it's easy or what you should do. – David Schwartz Jun 04 '16 at 15:21
  • @DavidSchwartz: Which spin lock implementation would you recommend? Neither std nor boost offer one, so it isn't obvious. Folly has some things, google has some things, all of them offer a fair amount more than strictly necessary. I'd be interested in a plain spin lock which might include the pause, as well as something just a touch more elaborate that also supplies a bool is_locked function, based on std::atomic for example. – Cookie Mar 21 '17 at 15:22
  • @Cookie You can write one for a specific CPU and platform. You cannot write a portable one. For example, on x86, you need to issue a `pause` instruction (`rep nop`) in the inner loop. There is no portable way to do this -- you need CPU expertise. – David Schwartz Mar 21 '17 at 18:03
  • There is another problem related: Spinning on compare-exchange may involve a lot of cache-coherency issues. A better solution is thus spinning only on atomic loads, where no cache coherency is required until the lock is released. Look here for more information: https://rigtorp.se/spinlock/. It also mentions the already-discussed problem with "pausing". – Daniel Langr Dec 28 '20 at 13:25
  • 1
    This question is also relevant: https://stackoverflow.com/q/63008857/580083. Quoting from one of its comments: _"If you want a core to be able to read but not write a line while checking lock availability, **spin read-only with a load separate from the CAS attempt**, xchg, or lock bts. This is clearly better because it leaves the line in S state, not E, and is (or should be) a well known fact among lock and other spin-loop implementers (on par with using `pause` in the spin-retry part)."_ (credit @PeterCoders). – Daniel Langr Dec 28 '20 at 13:32
  • For a basic spinlock implementation in C, I added an answer here: [Basic spin lock implementation in C11](https://stackoverflow.com/a/73819087/4561887) – Gabriel Staples Sep 22 '22 at 18:18

2 Answers2

49

The problem is that compare_exchange_weak updates the unlatched variable once it fails. From the documentation of compare_exchange_weak:

Compares the contents of the atomic object's contained value with expected: - if true, it replaces the contained value with val (like store). - if false, it replaces expected with the contained value .

I.e., after the first failing compare_exchange_weak, unlatched will be updated to true, so the next loop iteration will try to compare_exchange_weak true with true. This succeeds and you just took a lock that was held by another thread.

Solution: Make sure to set unlatched back to false before each compare_exchange_weak, e.g.:

while(!latch.compare_exchange_weak(unlatched, true, std::memory_order_acquire)) {
    unlatched = false;
}
gexicide
  • 38,535
  • 21
  • 92
  • 152
  • 5
    +1, and note to the OP, the default constructor of `Node` will *not* value-initialize `number` as-presented. I.e. `Node node;` will not leave a deterministic value in `node.number` to start your sequence. You need a constructor of such usage, `Node() : number() {}` is sufficient. [**see it live**](http://coliru.stacked-crooked.com/a/92d79ddfb574ba70) – WhozCraig Oct 27 '14 at 08:34
  • Thanks for your comments!! And I found my pool mistakes! thank you – syko Oct 27 '14 at 08:42
  • 5
    please note, that the loop also requires `__mm_pause()` to enable hyperthreads. – Anton Oct 27 '14 at 08:57
45

As mentioned by @gexicide, the problem is that the compare_exchange functions update the expected variable with the current value of the atomic variable. That is also the reason, why you have to use the local variable unlatched in the first place. To solve this you can set unlatched back to false in each loop iteration.

However, instead of using compare_exchange for something its interface is rather ill suited for, it is much simpler to use std::atomic_flag instead:

class SpinLock {
    std::atomic_flag locked = ATOMIC_FLAG_INIT ;
public:
    void lock() {
        while (locked.test_and_set(std::memory_order_acquire)) { ; }
    }
    void unlock() {
        locked.clear(std::memory_order_release);
    }
};

Source: cppreference

Manually specifying the memory order is just a minor potential performance tweak, which I copied from the source. If simplicity is more important than the last bit of performance, you can stick to the default values and just call locked.test_and_set() / locked.clear().

Btw.: std::atomic_flag is the only type that is guaranteed to be lock free, although I don't know any platform, where oparations on std::atomic_bool are not lock free.

Update: As explained in the comments by @David Schwartz, @Anton and @Technik Empire, the empty loop has some undesirable effects like branch missprediction, thread starvation on HT processors and overly high power consumption - so in short, it is a pretty inefficient way to wait. The impact and solution are architecture, platform and application specific. I'm no expert, but the usual solution seems to be to add either a cpu_relax() on linux or YieldProcessor() on windows to the loop body.

EDIT2: Just to be clear: The portable version presented here (without the special cpu_relax etc instructions) should alredy be good enough for many applications. If your SpinLock spins a lot because someone other is holding the lock for a long time (which might already indicate a general design problem), it is probably better to use a normal mutex anyway.

MikeMB
  • 20,029
  • 9
  • 57
  • 102
  • Just remember to initialize the flag with ATOMIC_FLAG_INIT. – Yochai Timmer Apr 20 '15 at 10:48
  • Thanks for a light, clean and easy implementation. – tomi.lee.jones Jun 15 '15 at 09:35
  • As per David Schwartz's comments on the question, should probably do a `cpu_relax();` inside the while loop and for windows just do `#define cpu_relax() asm volatile("pause" ::: "memory")`. +1 though. –  Jan 24 '16 at 04:49
  • 2
    just add a std::this_thread::yield call into the while loop: http://en.cppreference.com/w/cpp/thread/yield – Martin Gerhardy Jun 06 '16 at 07:58
  • 3
    @Martin: I though about that, but `std::this_thread::yield()` is a pretty heavy system call, so I'm not sure, if I'd put it in the loop body of a spinlock. My (untested) assumption is that in most situations, where that would be OK, you'd want to use a std::mutex (or similar) to begin with. – MikeMB Jun 06 '16 at 08:23
  • @Mike: thanks for the info - will do some performance measurements on my own implementation. – Martin Gerhardy Jun 07 '16 at 06:53
  • @Martin: It would be great, if you could share your findings when you do. – MikeMB Jun 07 '16 at 08:36
  • 24
    @MartinGerhardy There is no point in using yield in a spin-lock. The goal of spin-locks is avoiding context switches which are expensive in small critical sections. – Jorge Bellon Jan 05 '17 at 10:13
  • @JorgeBellón On single core system, if the lock is owned by another thread, you'll need context switches anyway, yield would be more efficient. – rox Dec 06 '17 at 16:18
  • @JorgeBellón FWIW, a yield version spin lock is more light-weight than mutex, because a yield call doesn't change the execution state of the thread on linux system, while mutex does. – rox Dec 06 '17 at 16:23
  • 4
    @rox that's not my point. The technical definition of spin mutex is a lock that does not release the CPU if it is already acquired by others. In single core systems you don't usually use spinlocks but a regular mutex. – Jorge Bellon Dec 06 '17 at 17:32
  • 2
    @JorgeBellón As far as I know, the glibc implementation uses something similar to yield: https://github.com/lattera/glibc/blob/master/mach/spin-solid.c . I'm not sure where you found the definition. – rox Dec 07 '17 at 05:01