How to implement a re-usable thread barrier with std::atomic

Question

I have N threads performing various task and these threads must be regularly synchronized with a thread barrier as illustrated below with 3 thread and 8 tasks. The || indicates the temporal barrier, all threads have to wait until the completion of 8 tasks before starting again.

Thread#1  |----task1--|---task6---|---wait-----||-taskB--|          ...
Thread#2  |--task2--|---task5--|-------taskE---||----taskA--|       ...
Thread#3  |-task3-|---task4--|-taskG--|--wait--||-taskC-|---taskD   ...

I couldn’t find a workable solution, thought the little book of Semaphores http://greenteapress.com/semaphores/index.html was inspiring. I came up with a solution using std::atomic shown below which “seems” to be working using three std::atomic. I am worried about my code breaking down on corner cases hence the quoted verb. So can you share advise on verification of such code? Do you have a simpler fool proof code available?

std::atomic<int> barrier1(0);
std::atomic<int> barrier2(0);
std::atomic<int> barrier3(0);

void my_thread()
{

  while(1) {
    // pop task from queue
    ...
    // and execute task 
    switch(task.id()) {
      case TaskID::Barrier:
        barrier2.store(0);
        barrier1++;
        while (barrier1.load() != NUM_THREAD) {
          std::this_thread::yield();
        }
        barrier3.store(0);
        barrier2++;
        while (barrier2.load() != NUM_THREAD) {
          std::this_thread::yield();
        }
        barrier1.store(0);
        barrier3++;
        while (barrier3.load() != NUM_THREAD) {
          std::this_thread::yield();
        }
       break;
     case TaskID::Task1:
       ...
     }
   }
}

Since you know about semaphores, what's wrong with using a semaphore? — edgar.holleis, Jun 13 '14 at 12:28
No native std::semaphore in C++11 standard library, so you have to use std::mutex and std::condition_variable. — user3636086, Jun 13 '14 at 12:57
Does this help? http://stackoverflow.com/questions/8115267/writing-a-spinning-thread-barrier-using-c11-atomics?rq=1 I could close as a duplicate with a single vote, but I'm sure if it's the same question. — R. Martinho Fernandes, Jun 13 '14 at 13:00
@R.MartinhoFernandes The interesting question is whether we want a spinning solution or not. Typically, barriers are not implemented that way, since it is expected that some threads will have to wait for others to complete and we don't want to block cores with busy waiting during that time. A solution based on `condition_variable` that sends waiting threads to sleep seems more appropriate here. — ComicSansMS, Jun 13 '14 at 14:33
You are right, that's the key question. My proposal is a spinning solution so it may not be that efficient. I'll switch to the [condition_variable] found [here](http://stackoverflow.com/questions/8115267/writing-a-spinning-thread-barrier-using-c11-atomics?rq=1). — user3636086, Jun 13 '14 at 15:08
Just use a semaphore. http://stackoverflow.com/a/4793662/412080 — Maxim Egorushkin, Jun 16 '14 at 14:18

ComicSansMS · Answer 1 · 2014-06-16T14:54:38.177

Boost offers a barrier implementation as an extension to the C++11 standard thread library. If using Boost is an option, you should look no further than that.

If you have to rely on standard library facilities, you can roll your own implementation based on std::mutex and std::condition_variable without too much of a hassle.

class Barrier {
    int wait_count;
    int const target_wait_count;
    std::mutex mtx;
    std::condition_variable cond_var;

    Barrier(int threads_to_wait_for)
     : wait_count(0), target_wait_count(threads_to_wait_for) {}

    void wait() {
        std::unique_lock<std::mutex> lk(mtx);
        ++wait_count;
        if(wait_count != target_wait_count) {
            // not all threads have arrived yet; go to sleep until they do
            cond_var.wait(lk, 
                [this]() { return wait_count == target_wait_count; });
        } else {
            // we are the last thread to arrive; wake the others and go on
            cond_var.notify_all();
        }
        // note that if you want to reuse the barrier, you will have to
        // reset wait_count to 0 now before calling wait again
        // if you do this, be aware that the reset must be synchronized with
        // threads that are still stuck in the wait
    }
};

This implementation has the advantage over your atomics-based solution that threads waiting in condition_variable::wait should get send to sleep by your operating system's scheduler, so you don't block CPU cores by having waiting threads spin on the barrier.

A few words on resetting the barrier: The simplest solution is to just have a separate reset() method and have the user ensure that reset and wait are never invoked concurrently. But in many use cases, this is not easy to achieve for the user.

For a self-resetting barrier, you have to consider races on the wait count: If the wait count is reset before the last thread returned from wait, some threads might get stuck in the barrier. A clever solution here is to not have the terminating condition depend on the wait count variable itself. Instead you introduce a second counter, that is only increased by the thread calling the notify. The other threads then observe that counter for changes to determine whether to exit the wait:

void wait() {
    std::unique_lock<std::mutex> lk(mtx);
    unsigned int const current_wait_cycle = m_inter_wait_count;
    ++wait_count;
    if(wait_count != target_wait_count) {
        // wait condition must not depend on wait_count
        cond_var.wait(lk, 
            [this, current_wait_cycle]() { 
                return m_inter_wait_count != current_wait_cycle;
            });
    } else {
        // increasing the second counter allows waiting threads to exit
        ++m_inter_wait_count;
        cond_var.notify_all();
    }
}

This solution is correct under the (very reasonable) assumption that all threads leave the wait before the inter_wait_count overflows.

The boost Barrier is not re-useable, the note included as a comment in the code flags the difficulty of resetting wait_count to 0. I ran into the same problem while writing my code, i.e with only 2 atomic variables, there is a deadlock problem. — user3636086, Jun 13 '14 at 16:34
@user3636086 The boost barrier _is_ reusable. It resets automatically once all the threads have arrived and you can just call wait again as often as you want. — ComicSansMS, Jun 13 '14 at 16:36
@user3636086 I added a few words on how to implement the resetting. — ComicSansMS, Jun 14 '14 at 09:07

cmaster - reinstate monica · Accepted Answer · 2014-06-16T14:10:34.363

0

With atomic variables, using three of them for a barrier is simply overkill that only serves to complicate the issue. You know the number of threads, so you can simply atomically increment a single counter every time a thread enters the barrier, and then spin until the counter becomes greater or equal to N. Something like this:

void barrier(int N) {
    static std::atomic<unsigned int> gCounter = 0;
    gCounter++;
    while((int)(gCounter - N) < 0) std::this_thread::yield();
}

If you don't have more threads than CPU cores and a short expected waiting time, you might want to remove the call to std::this_thread::yield(). This call is likely to be really expensive (more than a microsecond, I'd wager, but I haven't measured it). Depending on the size of your tasks, this may be significant.

If you want to do repeated barriers, just increment the N as you go:

unsigned int lastBarrier = 0;
while(1) {
    switch(task.id()) {
        case TaskID::Barrier:
            barrier(lastBarrier += processCount);
            break;
    }
}

edited Jun 16 '14 at 14:10

answered Jun 14 '14 at 10:23

cmaster - reinstate monica

38,891
9
62
106

Simple and efficient. I've tried in my code, works ok but I had to change while(gCounter < N) by while(gCounter != N) due to the nature of my thread pool. Thanks – user3636086 Jun 16 '14 at 13:28
I used `gCounter < N` on purpose: As far as I can see, your threads are not guaranteed to synchronize between barrier invocations. So one thread might run ahead while another one is still preempted inside the barrier and call the next barrier before the late threads leaves the barrier. In that case, the counter may skip the `N` the late thread is waiting for, causing it to deadlock. I suppose, your issue with `gCounter < N` was that you want the counter to wrap around? I have edited my answer to work under these circumstances: you need to use an unsigned integer type when you expect overflow. – cmaster - reinstate monica Jun 16 '14 at 14:15
1

Note that if your thread is a high-priority real-time FIFO thread on Linux this code may a) hog the cpu, b) lead to priority inversion. Don't use this code. Just use `std::mutex` + `std::condition_variable`. – Maxim Egorushkin Jun 16 '14 at 14:23
@MaximYegorushkin I don't know what you are referring to with a) and b), but I have the impression that you have not read my answer completely: I specifically say that the `yield()` call should not be removed unless there is at most one thread per core. As long as the threads that have not reached the barrier make progress, this code won't lock. – cmaster - reinstate monica Jun 16 '14 at 14:33
@cmaster `yield()` would return immediately when called from a high-priority real-time FIFO thread because there may be no higher priority threads. – Maxim Egorushkin Jun 16 '14 at 14:34
@MaximYegorushkin Still, as long as you don't take more threads than you have cores, it can't deadlock. And you should never use more threads than you have cores anyway, at least not in a thread-pool that is used to dispatch tasks to CPUs. – cmaster - reinstate monica Jun 16 '14 at 14:41
1

@cmaster It won't deadlock in any case, but it might slow down the system considerably, even if the number of threads is equal or slightly less than the number of physical cores. The problem with busy waiting is that you are basically tricking the system into thinking that you have _lots_ of work to be done. So you will get lots of resources in terms of both CPU time assigned by the scheduler and actual electrical power consumed by the CPU. If you decide to spend those resources on looping _really fast_ on a single variable, that's your decision to make. – ComicSansMS Jun 16 '14 at 14:57
@ComicSansMS Well, there is nothing that guarantees that this code will make forward progress either. Imagine running it on a 8-core box in a cgroup with only 2 CPU assigned. Your code still detects 8 cores, but the scheduler won't give it more that 2 cores at a time... – Maxim Egorushkin Jun 16 '14 at 15:04
@MaximYegorushkin True, with a sufficiently unfair scheduler you might indeed deadlock. On most modern desktop OSes you will probably get away with a slowdown, but when moving to embedded hardware, schedulers might turn out to be less forgivable. – ComicSansMS Jun 16 '14 at 15:08
@ComicSansMS All of what you said is true. The only assumption I make is, that the thread setup itself is sane. I. e. that each thread has a core pretty much to itself, and that the load is fairly balanced. If either one is not the case, you have much more pressing performance problems on your hand, the time you loose syncing will dwarf all other considerations. However, if both are true, then busy waiting is not such a bad idea. Take a look at the linux people, they use spinlocks all over the place. Why? Because the average waiting time is much smaller than the overhead of scheduling. – cmaster - reinstate monica Jun 16 '14 at 15:25
@cmaster Right, under certain expectations spinning is the best solution. What is dangerous here is that with a barrier those expectations are seldomly true. With a spinlock mutex you only need to wait if it is currently locked. Unless the contention is very high, this is unlikely, an exceptional case. With a barrier however you expect that all but one thread will have to wait for some time. Waiting is the normal case. What's worse, if just one thread is slow, all the others wait a lot. I agree that spinning can be right here, it's just that with a barrier most of the time it is not. – ComicSansMS Jun 16 '14 at 15:38
@cmaster In Linux kernel they often use a spin-lock in a context when a thread can not be blocked, e.g. in an interrupt handler. – Maxim Egorushkin Jun 16 '14 at 15:53

score 0 · Answer 3 · answered Nov 21 '17 at 14:33

I would like to point out that in the solution given by @ComicSansMS , wait_count should be reset to 0 before executing cond_var.notify_all();

This is because when the barrier is called a second time the if condition will always fail, if wait_count is not reset to 0.

How to implement a re-usable thread barrier with std::atomic

3 Answers3

Linked