9

I have found that pthread_barrier_wait is quite slow, so at one place in my code I replaced pthread_barrier_wait with my version of barrier (my_barrier), which uses an atomic variable. I found it to be much faster than pthread_barrier_wait. Is there any flaw of using this approach? Is it correct? Also, I don't know why it is faster than pthread_barrier_wait? Any clue?

EDIT

  • I am primarily interested in cases where there are equal number of threads as cores.

    atomic<int> thread_count = 0;
    
    void my_barrier()
    {     
      thread_count++;
    
      while( thread_count % NUM_OF_THREADS )
       sched_yield();
    }
    
MetallicPriest
  • 29,191
  • 52
  • 200
  • 356
  • 1
    The only difference I can see is that pthread_barrier_wait() will pick one thread to pass PTHREAD_BARRIER_SERIAL_THREAD while all others will get 0. If you don't need this functionality, your implementation looks equivalent to me. – kfmfe04 Jan 26 '12 at 14:39
  • 2
    @kfmfe04: Far from it. See my answer. Implementing a barrier that has all the properties required by the POSIX standard (many of which are useful a lot more often than you'd think) is far from easy. – R.. GitHub STOP HELPING ICE Jan 27 '12 at 04:24

3 Answers3

8

Your barrier implementation does not work, at least not if the barrier will be used more than once. Consider this case:

  1. NUM_OF_THREADS-1 threads are waiting at the barrier, spinning.
  2. Last thread arrives and passes through the barrier.
  3. Last thread exits barrier, continues processing, finishes its next task, and reenters the barrier wait.
  4. Only now do the other waiting threads get scheduled, and they can't exit the barrier because the counter was incremented again. Deadlock.

In addition, one often-overlooked but nasty issue to deal with using dynamically allocated barriers is destroying/freeing them. You'd like any one of the threads to be able to perform the destroy/free after the barrier wait returns as long as you know nobody will be trying to wait on it again, but this requires making sure all waiters have finished touching memory in the barrier object before any waiters wake up - not an easy problem to solve. See my past questions on implementing barriers...

How can barriers be destroyable as soon as pthread_barrier_wait returns?

Can a correct fail-safe process-shared barrier be implemented on Linux?

And unless you know you have a special-case where none of the difficult problems apply, don't try implementing your own for an application.

Community
  • 1
  • 1
R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • I actually invented it after in my program I had a piece of code sandwitched b/w two pthread_barrier_waits, so I replaced the first one with my_barrier. What do you think, is it valid for such use or also not valid there? – MetallicPriest Jan 27 '12 at 17:13
  • Yours will probably work if the barrier is only ever used once, and never destroyed/never destroyable. But it's going to be damn slow when a large number of threads go into a loop spinning on the barrier counter rather than sleeping... – R.. GitHub STOP HELPING ICE Jan 28 '12 at 01:29
3

AFAICT it's correct, and it looks like it's faster, but in the high contended case it'll be a lot worse. The hight contended case being when you have lots of threads, way more than CPUs.

There's a way to make fast barriers though, using eventcounts (look at it through google).

struct barrier {
    atomic<int>       count;
    struct eventcount ec;
};

void my_barrier_wait(struct barrier *b)
{
    eventcount_key_t key;

    if (--b->count == 0) {
        eventcount_broadcast(&b->ec);
        return;
    }
    for (;;) {
        key = eventcount_get(&b->ec);
        if (!b->count)
            return;
        eventcount_wait(&b->ec);
    }
}

This should scale way better.

Though frankly, when you use barriers, I don't think performance matters much, it's not supposed to be an operation that needs to be fast, it looks a lot like too early optimization.

Pierre Habouzit
  • 690
  • 4
  • 6
  • Well, in my system there are equal number of threads as cores, so contention is not an issue. Secondly, if the algorithm you have mentioned is really that fast, why don't pthread_barrier_wait uses it? Lastly, I don't think its a preoptimization, as some of the benchmarks I'm working with have thousands of barriers per second! – MetallicPriest Jan 26 '12 at 14:37
  • It shouldn't be that bad when having lots of threads, since the waiting threads will sleep most of their time (due to the yield), only spuriously waking up to check if they can continue executing. So it might wake up a bit more often, but it shouldn't scale to badly I think (might be wrong though). – Grizzly Jan 26 '12 at 15:13
  • event counts aren't that well known. The idea though is that with luck, the last threads doing the count-- won't wait on a futex or similar. – Pierre Habouzit Jan 26 '12 at 15:29
2

Your barrier should be correct from what I can see, as long as you don't use the barrier to often or your thread number is a power of two. Theoretically your atomic will overflow somewhere (after hundreds of millions of uses for typical core counts, but still), so you might want to add some functionality to reset that somewhere.

Now to why it is faster: I'm not entirely sure, but I think pthread_barrier_wait will let the thread sleep till it is time to wake up. Yours is spinning on the condition, yielding in each iteration. However if there is no other application/thread which needs the processing time the thread will likely be scheduled again directly after the yield, so the wait time is shorter. At least thats what playing around with that kind of barriers seemed to indicate on my system.

As a side note: since you use atomic<int> I assume you use C++11. Wouldn't it make sense to use std::this_thread::yield() instead of sched_yield() in that case to remove the dependency on pthreads?

This link might also be intressting for you, it measures the performance of various barrier implementations (yours is rougly the lock xadd+while(i<NCPU) case, except for the yielding)

Grizzly
  • 19,595
  • 4
  • 60
  • 78