I'm only asking this to try to understand what I've spent 24 hours trying to fix.
My system: Ubuntu 12.04.2, Matlab R2011a, both of them 64-bit, Intel Xeon processor based on Nehalem.
The problem is simply, Matlab allows OpenMP based programs to utilize all CPU cores with hyper-threading enabled but does not allow the same for TBB.
When running TBB, I can launch only 4 threads, even when I change the maxNumCompThreads to 8. While with OpenMP I can use all the threads I want. Without Hyper-threading, both TBB and OpenMP utilize all 4 cores of course.
I understand Hyper-threading and that its virtual, but the limitation matlab does, actually does cause a penalty on the performance (an extra reference).
I tested this issue using 2 programs, a simple for loop with
#pragma omp parallel for
and another very simple loop based on a tbb sample code.
tbb::task_scheduler_init init(tbb::task_scheduler_init::deferred);
tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());
and wrapped both of them with a matlab mexFunction.
Does any one have an explanation for this? Is there an inherent difference in the thread creation method or structure that allows matlab to throttle TBB but does not allow this throttoling for OpenMP?
Code for reference:
OpenMP:
#include "mex.h"
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] ){
threadCount = 100000;
#pragma omp parallel for
for(int globalId = 0; globalId < threadCount ; globalId++)
{
for(long i=0;i<1000000000L;++i) {} // Deliberately run slow
}
}
TBB:
#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include <iostream>
#include <vector>
#include "mex.h"
struct mytask {
mytask(size_t n)
:_n(n)
{}
void operator()() {
for (long i=0;i<1000000000L;++i) {} // Deliberately run slow
std::cerr << "[" << _n << "]";
}
size_t _n;
};
template <typename T> struct invoker {
void operator()(T& it) const {it();}
};
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const
mxArray* prhs[]) {
tbb::task_scheduler_init init(tbb::task_scheduler_init::deferred); // Automatic number of threads
std::vector<mytask> tasks;
for (int i=0;i<10000;++i)
tasks.push_back(mytask(i));
tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());
}