poor performance with boost conditional mutex

Question

I am new to using conditional_variables so I could easily be doing something stupid here but I am getting some odd performance when I use boost threads versus just calling the function directly. If I change the line that creates a boost thread on func to just call func directly, the code runs several orders faster. I have tried using the boost threadpool software off of source forge and it makes no difference...

Here is the code:

#include <boost/thread.hpp>


using namespace boost;

condition_variable cond;
mutex conditionalMutex;
int numThreadsCompleted = 0;
int numActiveThreads = 0;

void func()
{
  {
    lock_guard<mutex> lock(conditionalMutex);
    --numActiveThreads;
    numThreadsCompleted++;
  }
  cond.notify_one();
};


int main()
{
  int i=0;
  while (i < 100000)
    {
      if (numActiveThreads == 0)
        {
          ++numActiveThreads;
          thread thd(func);
          //Replace above with a direct call to func for several orders of magnitude
          //performance increase...
          ++i;
        }
      else
        {
          unique_lock<mutex> lock(conditionalMutex);
          while (numThreadsCompleted == 0)
            {
              cond.wait(lock);
            }
          numThreadsCompleted--;
        }
    }
  return 0;
}

score 1 · Answer 1 · answered Aug 17 '12 at 10:38

The performance must be much worse than calling the function directly. You start one thread, and then wait for that thread to end. Even if you reduce the overhead of starting thread to zero, you communicate with that thread. And you will have at least one context switch and as your func() is basically doing nothing, that overhead becomes the big factor. Add some more payload into func() and the ratio will change. If the thing that have to be done is very little, just do it on the thread that found this thing.

BTW: You have a race condition because you write to numActiveThreads without a mutex being locked. The code above boils down to:

int main()
{
    int i=0;
    while (i < 100000)
    {
        thread thd(func);
        thd.join();
        ++i;
    }

    return 0;
}

and there is really no reason why this should be faster than:

int main()
{
    int i=0;
    while (i < 100000)
    {
        func();
        ++i;
    }

    return 0;
}

score 0 · Answer 2 · answered Aug 16 '12 at 01:36

0

You're creating and destroying threads, which are usually implemented as some lower level OS construct, usually some sort of light weight process. This creation and destroying can be costly.

Finally, you're basically doing

Create Thread
Wait for Thread to exit

over and over again. This means the creation/destroying, and you're doing it every time, so the costs are going to add up.

answered Aug 16 '12 at 01:36

Dave S

20,507
3
48
68

I thought the overhead of creating and destroying threads would be the issue as well so I tried threadpools: 34 seconds with just boost::thread, 7 seconds with threadpools and .16 seconds with function call. This is after increasing the loop length to give larger numbers... – Ronald Van Iwaarden Aug 16 '12 at 02:17

score 0 · Answer 3 · edited May 23 '17 at 10:34

In addition to the overhead from creating and destroying the thread, branch prediction may be contributing to the difference in performance.

Without threading, the if-statement is always true, as numActiveThreads will be 0 at the start and end of each loop iteration:

while (i < 100000)
{
  if (numActiveThreads == 0) // branch always taken
  {
    ++numActiveThreads; // numActiveThreads = 1
    func();             // when this returns, numActiveThreads = 0
    ++i;                
  }
}

This results in:

Branch prediction never failing.
No overhead for thread creation/destruction.
No time spent blocked waiting to acquire conditionalMutex.

With threading, numActiveThreads may or may not be 0 in sequential iterations. On most machines I tested, short predictable patterns were observed, with branching alternating between the if-statement and else-statement on each iteration. However, sometimes the if-statement is selected in sequential iterations. Thus, time may be wasted with:

Branch prediction failure.
The creation and destruction of threads. If the creation and destruction are concurrent, then synchronization may be occurring in the underlying thread library.
Blocked waiting to acquire conditionMutex or waiting on cond.

Theoretically, branch prediction penalty should not be a big factor contributing to the performance draw-down. One miss prediction on a branch typically costs around less than a hundred cycles, while thread creation and destroy could be much more cpu-consuming. — WiSaGaN, Aug 17 '12 at 10:43

poor performance with boost conditional mutex

3 Answers3