I have found that pthread_barrier_wait is quite slow, so at one place in my code I replaced pthread_barrier_wait with my version of barrier (my_barrier), which uses an atomic variable. I found it to be much faster than pthread_barrier_wait. Is there any flaw of using this approach? Is it correct? Also, I don't know why it is faster than pthread_barrier_wait? Any clue?
EDIT
I am primarily interested in cases where there are equal number of threads as cores.
atomic<int> thread_count = 0; void my_barrier() { thread_count++; while( thread_count % NUM_OF_THREADS ) sched_yield(); }