4

I have a embarrassingly parallel problem that I want to execute on multiple processors. I had supposed that boost::thread would automatically send new threads to new processors, but all of them are executing on the same core as the parent process. Is it possible to get each thread to run on a different processor, or do I need something like MPI?

My suspicion is that boost::thread is simply not a multi-processor tool, that I'm asking it to do something it's not designed for.

EDIT: my question boils down to this: Why do all the threads execute on one processor? Is there a way to get boost::thread to send threads to different processors?

Here's the relevant sample of my code:

size_t lim=1000;
std::deque<int> vals(lim);
std::deque<boost::thread *> threads;
int i=0; 
std::deque<int>::iterator it = vals.begin();
for (; it!=sigma.end(); it++, i++) {
  threads.push_back(new boost::thread(doWork, it, i));
  while (threads.size() >= maxConcurrentThreads) {
    threads.front()->join();
    delete threads.front();
    threads.pop_front();
  }
}
while(threads.size()) {
  threads.front()->join();
  threads.pop_front();
}

As should be clear, doWork does some calculation using the parameter i and stores the result in vals. My idea was that setting maxConncurrentThreads to be equal to the number of cores available, and then each thread would use the core that was idle. I just need someone to confirm that boost::thread cannot be made to work in this way.

(I'd guess that there's a better way to limit the number of concurrent threads than using a queue; feel free to scold me for that as well.)


Here's the doWork function:

void doWork(std::deque<int>::iterator it, int i) {
  int ret=0;
  int size = 1000; // originally 1000, later changed to 10,000,000
  for (int j=i; j<i+size; j++) {
    ret+=j;
  }
  *it=ret;
  return;
}

EDIT: As Martin James suggested, the problem was that the doWork function was initially only 1000 int additions. With such a small job, scheduling the thread took longer than executing the thread, so only one processor was in use. Making the job longer (adding 10,000,000 ints) yielded the desired behavior. The point being: boost::thread will use multiple cores by default, but if your threads do less work than scheduling the thread then you won't see any benefit from multithreading.

Thanks to everyone for aiding my understanding in this.

Shep
  • 7,990
  • 8
  • 49
  • 71
flies
  • 2,017
  • 2
  • 24
  • 37
  • 1
    Right, multiple threading and multiprocessing are quite different concepts, and boost::thread supports the former. – juanchopanza Apr 25 '12 at 16:06
  • Sounds like MPI to me... welcome to my world ! – Scottymac Apr 25 '12 at 16:21
  • 2
    I don't think that has anything to do with MPI, he is only mixing the words multiprocessor and multicore system. – Stephan Dollberg Apr 25 '12 at 16:25
  • @juanchopanza I understand you to be saying that `boost::thread` cannot be made to send each thread to a different core. Is that right? – flies Apr 25 '12 at 16:37
  • I would be surprised if DoWork() took any longer than a couple us. This is much, much less than the overhead load of creating the thread in the first place. Most of the work is being done by the one thread that is doing the continual create, join, wake up again almost immmediately, loop... – Martin James Apr 25 '12 at 17:02
  • No, that isn't what I meant. Different cores are fine, different processors not so. – juanchopanza Apr 25 '12 at 17:04
  • I may have to try this :(( What do you have 'maxConcurrentThreads' set to currently? – Martin James Apr 25 '12 at 17:10
  • @juanchopanza ah. Clearly, I'm no expert. Hence my question :) – flies Apr 25 '12 at 17:11
  • @MartinJames it's set to 6. I added a statement to check the size of the `threads` queue after the loop and it says it's got 5 elements, as expected. – flies Apr 25 '12 at 17:12
  • have you checked the number of cores you see with `boost::thread::hardware_concurrency()`? – juanchopanza Apr 25 '12 at 17:22
  • @juanchopanza: I didn't, but I printed it out and it says 8, which is the number of cores on my machine. – flies Apr 25 '12 at 17:26
  • I was looking at `top` and it only showed one process, but i failed to notice that that process was using 600% CPU. – flies Apr 25 '12 at 18:44
  • 1
    Usually, you can tell if something like this is working by listening. If I load up my box with 8 100% CPU threads, the CPU fan revs up within a couple of seconds. – Martin James Apr 25 '12 at 18:48
  • @MartinJames Heh. I'll have to remember that. (Of course, the server is separated from me by a few buildings, some concrete, and my over-ear headphones, but I bet if I listen real close...) – flies Apr 25 '12 at 19:02

1 Answers1

5

You are always joining the first thread in the queue. If this thread is taking a long time it might be the only thread left. I guess what you want is to start a new thread once any thread has completed.

I don't know why you only get an effective concurrency level of only one though.

After having looked at the doWork function I think that it is doing so little work that it is taking less work than starting a thread in the first place. Try running it with more work (1000x).

usr
  • 168,620
  • 35
  • 240
  • 369
  • did you mix `deque` with `queue`? – Stephan Dollberg Apr 25 '12 at 16:17
  • The code starts to join only if `threads.size() >= maxConcurrentThreads`. – megabyte1024 Apr 25 '12 at 16:21
  • @megabyte1024 that doesn't matter because if the first thread in the deque takes much longer than the others, all the others will finish before the first one and the only one running at a time is the first one. – Stephan Dollberg Apr 25 '12 at 16:22
  • 2
    Actually, my recommendation is to use a thread-pool. It will handle all of this for you. http://stackoverflow.com/questions/4084777/creating-a-thread-pool-using-boost – usr Apr 25 '12 at 16:26
  • @bamboon In fact it does not matter. – megabyte1024 Apr 25 '12 at 16:31
  • @megabyte1024 if the joined thread happens to execute a sleep(infinity) new threads will never be spawned. The concurrency level will erroneously fall to one. – usr Apr 25 '12 at 16:32
  • @megabyte1024 If you are talking about the mixing of `queue` and `deque` then you are right in this case, however I mix them all the time, too. The names are just too similar. – Stephan Dollberg Apr 25 '12 at 16:35
  • each execution of `doWork` will take very nearly the same amount of time. (In my test code, it's just adding 1000 numbers together. in actual practice, it'll be executing a linear-time algorithm.) So there's no reason to expect that this issue is causing the threads to all execute on one processor. A thread pool is clearly superior to the queue implementation here, and I will look into it, but that's not addressing the central problem: _Why are all threads on one core? How can I send threads to different cores?_ – flies Apr 25 '12 at 16:36
  • You are right. Can you post doWork in its entirety? Btw, you need to delete the completed thread objects (does not matter here). – usr Apr 25 '12 at 16:42
  • 3
    Adding 1000 numbers together? That thread is probably done by the time you get to creating the second thread, so the second thread may well run on the same core because that core has the process context already set. Do some heavier work! – Martin James Apr 25 '12 at 16:42
  • Adding up all the numbers to 10,000,000 is better loading - you may then have a chance to see what is going on. – Martin James Apr 25 '12 at 16:49
  • I just looked at the code. I agree with @MartinJames. I added to my answer. – usr Apr 25 '12 at 17:04
  • changing it to 10,000,000 additions yields the same (single processor) behavior. – flies Apr 25 '12 at 17:05
  • OK, now I'm puzzled. OK, OP code is not the usual way of distributing work to threads, but it should work. – Martin James Apr 25 '12 at 17:09