4

I have recently implemented a fair reader-writer ticket-spinlock in C++. The code is fairly simple and I thought it was working great. I have integrated the spinlock into a larger application and I noticed that on some rare occasions, the code is just running extremely slowly while most of the time, it works really fast. I know it is due to the spinlock because if I replace it immediately with a simple reader-writer spinlock (not fair and no ticket), the code suddenly just runs much faster. It happened a few times on different machines. I know that those kind of locks can run slowly if you run them with more threads than cores but I ran it with 16 threads on a machine with 48 cores. I couldn't reproduce the issue on my laptop with 4 threads and 4 cores. Here is the code:

    inline size_t rndup(size_t v) {

        v--;
        v |= v >> 1;
        v |= v >> 2;
        v |= v >> 4;
        v |= v >> 8;
        v |= v >> 16;
        v |= v >> 32;
        v++;

        return v;
    }    

    class SpinLockRW_MCS {

        public:

            SpinLockRW_MCS(const size_t nb_readers) :   writer(nullptr), lock_pool(nullptr), it_lock_pool(0),
                                                        load_lock_pool(0), mask_it(rndup(2 * nb_readers + 1) - 1),
                                                        padding1{0}, padding2{0}, padding3{0}, padding4{0} {

                if (nb_readers <= std::thread::hardware_concurrency()){

                    lock_pool = new Lock[mask_it + 1];
                    lock_pool[0].is_locked = false;
                }
            }

            ~SpinLockRW_MCS() {

                clear();
            }

            inline void clear() {

                if (lock_pool != nullptr){

                    delete[] lock_pool;
                    lock_pool = nullptr;
                }

                writer = nullptr;

                it_lock_pool = 0;
                load_lock_pool = 0;
            }

            inline void acquire_reader() {

                uint_fast32_t retry = 0;

                const size_t prev_reader_id = it_lock_pool.fetch_add(1) & mask_it;
                const size_t new_reader_id = (prev_reader_id + 1) & mask_it;

                while (lock_pool[prev_reader_id].is_locked){

                    if (++retry > 100) this_thread::yield();
                }

                ++load_lock_pool;

                lock_pool[prev_reader_id].is_locked = true;
                lock_pool[new_reader_id].is_locked = false;
            }

            inline void release_reader() {

                --load_lock_pool;
            }

            inline void acquire_writer() {

                uint_fast32_t retry = 0;

                const size_t prev_reader_id = it_lock_pool.fetch_add(1) & mask_it;
                const size_t new_reader_id = (prev_reader_id + 1) & mask_it;

                while (lock_pool[prev_reader_id].is_locked){

                    if (++retry > 100) this_thread::yield();
                }

                while (load_lock_pool){

                    if (++retry > 100) this_thread::yield();
                }

                lock_pool[prev_reader_id].is_locked = true;

                writer = &lock_pool[new_reader_id];
            }

            inline void release_writer() {

                writer->is_locked = false;
            }

            inline void release_writer_acquire_reader() {

                ++load_lock_pool;

                writer->is_locked = false;
            }

        private:

            struct Lock {

                std::atomic<bool> is_locked;
                const int padding[15];

                Lock() : is_locked(true), padding{0} {}
            };

            Lock* writer;
            const int padding1[14];
            Lock* lock_pool;
            const int padding2[14];
            const size_t mask_it;
            const int padding3[14];
            std::atomic<size_t> it_lock_pool;
            const int padding4[14];
            std::atomic<size_t> load_lock_pool;
    };

Any suggestion would be greatly appreciated! Thanks!

Pall Melsted
  • 358
  • 1
  • 12
  • 1
    At the very least you should add a call to the `_mm_pause()` intrinsic (`#include ` if available with your compiler) into your spin/wait loops to let the processor know you're in one of those. It helps performance and power efficiency. – 1201ProgramAlarm Apr 22 '19 at 14:05
  • 1
    @1201ProgramAlarm Isn't `this_thread::yield()` fulfilling the same pause intrinsic purpose as `_mm_pause()` but in a more standard C++ fashion? – Guillaume Holley Apr 22 '19 at 19:02
  • 2
    The `yield` will only be called after you spin count reaches the wait limit (100). The `_mm_pause()` will be beneficial for those 100 iterations where you're running in a very tight, fast loop. – 1201ProgramAlarm Apr 22 '19 at 19:56
  • @1201ProgramAlarm Thanks for the tip, I will try this. – Guillaume Holley Apr 23 '19 at 15:02

2 Answers2

4

It is somewhat difficult assessing the problem without having more details, but here's my shot in the dark: I suspect that in your scenario readers need to acquire locks very frequently (otherwise, you would probably be better off with a traditional lock). Here's your problem:

Any single thread is able to starve out all others.

This is true for both readers and writers, while in a non-fair algorithm it is usually only true for writers. The problem in your situation happens when you have multiple readers queuing up for read access. Each thread will wait for the preceding lock to become available (while (lock_pool[prev_reader_id].is_locked) ...). If they can get that lock, everything is fine, but you get into trouble as soon as one thread cannot get it. All the reader threads queue up to see their predecessors flip to false. Each of them depends on their direct predecessor.

Now imagine the first reader not being able to get the lock. It will keep spinning for a while and eventually yield(). This effectively means your thread is now no longer running. The operating system dropped it from the scheduling queue and it won't run for a long time (the rest of their timeslice, which is long in comparison for how long it takes for the 100 spins to complete). As a consequence, the complete chain of waiting threads will go into yield very likely.

Eventually, the flag that the first thread was waiting for will flip to false. But your scheduler is now in for a predicament. It has a bunch of threads lying around, but they are only spinning for a very short time before going into yield again. The expectation here is that for all but the first thread in the chain, if they get picked they will almost certainly be doomed to lay dormant for one more complete timeslice. As a consequence, if this happens to a thread early in the chain of waiting threads, you also condemn all other threads in the chain to wait for at least as long.

What you are playing here is a game of probabilities where your odds of winning decrease significantly with growing number of readers in the queue. That's why the problem got worse when moving from 4 to 16 threads. In particular, once you reach the point where the average time it takes for a new reader to arrive in the queue is roughly in the order of the time it takes for a thread to move through the queue, you will have a hard time getting back to an empty queue again. This is not unlikely, as we are talking multiple of timeslices here, which brings you into the order of tens to hundreds of milliseconds.

This is a typical trade-off in fair scheduling algorithms. The fairness comes at a price, in this case that a single reader is able to block everyone. Since my reader can never overtake your reader if you managed to get to the acquire call first, I will have to wait forever if you don't move on. One solution to this problem is to give the scheduler additional information what each thread is waiting on, so that it has a better chance of waking them up in the right order. Another is to choose a different algorithm that is better suited to your particular scenario.

ComicSansMS
  • 51,484
  • 14
  • 155
  • 166
  • Thank you for your very thorough analysis, I do believe that you are right and it is likely to be a problem, if not THE problem, of this spinlock. Now that I think about it, every insertion of a writer would probably starve off all the readers coming afterward (since the first reader would have to probably yield). I would expect to have this case happening pretty often and I am surprised the spinlock is fast in most cases yet. You say that one solution is to give the scheduler additional information what each thread is waiting on, can you describe a bit more what you mean by that? – Guillaume Holley Apr 29 '19 at 10:06
  • @GuillaumeHolley What I meant there is basically abandoning the spinlock and going for a proper operating system mutex instead. Such a mutex is part of the OS and can thus supply information back to the kernel and scheduler. See [this answer](https://stackoverflow.com/a/17330049/577603) for a discussion on why spinning might have been the wrong choice in your case anyway. As soon as you actually have contention on the lock, spinning is usually the worst option. – ComicSansMS Apr 29 '19 at 10:17
0

My bet is your issue is somewhere around these lines:

if (++retry > 100) this_thread::yield();

I know this is how you plan to be 'optimistic', however hardcoded (arbitrary) values like these ('100' in this case) is normally an indication of a design flaw when dealing with this class of problem - like you say you only see the issue on another system which may be a symptom of that (since with this value it seems to works on your system). This in turn points to the this_thread::yield() as indicating part of the issue.

darune
  • 10,480
  • 2
  • 24
  • 62
  • It is very true that a hard-coded value is far from ideal but I do believe that calling `this_thread::yield();` on every single iteration would result in a much larger waste of CPU time. I chose 100 because it seemed to be a common value used in RW spinlocks implementation. [Facebook Folly library](https://github.com/facebook/folly/blob/master/folly/synchronization/RWSpinLock.h) uses 1000 but I didn't notice any performance gain/loss between 100 and 1000. Do you have an suggestion to work around a hard-coded value? – Guillaume Holley Apr 25 '19 at 14:30
  • @GuillaumeHolley Perhaps try to just remove the yield (maybe use `_mm_pause()` as suggested by another user). Then retest on your and the other machine - now without the hardcoded value. That being said, 1000 does sound like a more reasonably hardcoded value - i wouldn't go below ~2-300 ish. But again the issue is that it doesn't really take into account the platform/system. – darune Apr 26 '19 at 09:02