TBB acting strange in Matlab Mex file

Question

Edited:< Matlab limits TBB but not OpenMP > My question is different than the one above, it's not duplicated though using the same sample code for illustration. In my case I specified num of threads in tbb initialization instead of using "deferred". Also I'm talking about the strange behavior between TBB in c++ and TBB in mex. The answer to that question only demonstrates thread initialization when running TBB in C++, not in MEX.

I'm trying to boost a Matlab mex file to improve performance. The strange thing I come across when using TBB within mex is that TBB initialization doesn't work as expected.

This C++ program performs 100% cpu usage and has 15 TBB threads when executing it alone:

main.cpp

#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include <iostream>
#include <vector>
#include "mex.h"

struct mytask {
  mytask(size_t n)
    :_n(n)
  {}
  void operator()() {
    for (long i=0;i<10000000000L;++i) {}  // Deliberately run slow
    std::cerr << "[" << _n << "]";
  }
  size_t _n;
};

template <typename T> struct invoker {
  void operator()(T& it) const {it();}
};

void mexFunction(/* int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[] */) {

  tbb::task_scheduler_init init(15);  // 15 threads

  std::vector<mytask> tasks;
  for (int i=0;i<10000;++i)
    tasks.push_back(mytask(i));

  tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());

}

int main()
{
    mexFunction();
}

Then I modified the code a little bit to make a MEX for matlab:

BuildMEX.mexw64

#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include <iostream>
#include <vector>
#include "mex.h"

struct mytask {
  mytask(size_t n)
    :_n(n)
  {}
  void operator()() {
    for (long i=0;i<10000000000L;++i) {}  // Deliberately run slow
    std::cerr << "[" << _n << "]";
  }
  size_t _n;
};

template <typename T> struct invoker {
  void operator()(T& it) const {it();}
};


void mexFunction( int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[] ) {

  tbb::task_scheduler_init init(15);  // 15 threads

  std::vector<mytask> tasks;
  for (int i=0;i<10000;++i)
    tasks.push_back(mytask(i));

  tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());

}

Eventually invoke BuildMEX.mexw64 in Matlab. I compiled(mcc) the following code snippet to Matlab binary "MEXtest.exe" and use vTune to profile its performance(run in MCR). The TBB within the process only initialized 4 tbb threads and the binary only occupies ~50% cpu usage. Why MEX is downgrading overall performance and TBB? How can I seize more cpu usage for mex?

MEXtest.exe

function MEXtest()

BuildMEX();

end

I'm not entirely clear here; are you using the [MATLAB Compiler](http://www.mathworks.com/products/compiler/) (`mcc`) to generate a standalone program that calls into the MEX-function (thus runs in the context of the [MCR](http://www.mathworks.com/products/compiler/mcr/))? Or are you comparing the MEX-function against a regular C++ program compiled externally and not tied to MATLAB whatsoever? The answer given below seems to suggest the former case. — Amro, Jun 18 '14 at 13:09
The former. I used mcc to generate a standalone program, which calls MEX function, and runs in MCR. — yfeng, Jun 18 '14 at 15:37

score 2 · Accepted Answer · answered Jun 18 '14 at 23:33

According to the scheduler class description:

This class allows to customize properties of the TBB task pool to some extent. For example it can limit concurrency level of parallel work initiated by the given thread. It also can be used to specify stack size of the TBB worker threads, though this setting is not effective if the thread pool has already been created.

This is further explained in the initialize() methods called by the constructor:

The number_of_threads is ignored if any other task_scheduler_inits currently exist. A thread may construct multiple task_scheduler_inits. Doing so does no harm because the underlying scheduler is reference counted.

(highlighted parts added by me)

I believe that MATLAB already uses Intel TBB internally, and it must have initialized a thread pool at a top level before the MEX-function is ever executed. Thus all task schedulers in your code are going to use the number of threads specified by internal parts of MATLAB, ignoring the value you specified in your code.

By default MATLAB must have initialized the thread pool with a size equal to the number of physical processors (not logicals), which is indicated by the fact that on my quad-core hyper-threaded machine I get:

>> maxNumCompThreads
Warning: maxNumCompThreads will be removed in a future release [...]
ans =
     4

OpenMP on the other has no scheduler, and we can control number of threads at runtime by calling the following functions:

#include <omp.h>
.. 
omp_set_dynamic(1);
omp_set_num_threads(omp_get_num_procs());

or by setting the environment variable:

>> setenv('OMP_NUM_THREADS', '8')

To test this proposed explanation, here is the code I used:

test_tbb.cpp

#ifdef MATLAB_MEX_FILE
#include "mex.h"
#endif

#include <cstdlib>
#include <cstdio>
#include <vector>

#define WIN32_LEAN_AND_MEAN
#include <windows.h>

#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for_each.h"
#include "tbb/spin_mutex.h"

#include "tbb_helpers.hxx"

#define NTASKS 100
#define NLOOPS 400000L

tbb::spin_mutex print_mutex;

struct mytask {
    mytask(size_t n) :_n(n) {}
    void operator()()
    {
        // track maximum number of parallel workers run
        ConcurrencyProfiler prof;

        // burn some CPU cycles!
        double x = 1.0 / _n;
        for (long i=0; i<NLOOPS; ++i) {
            x = sin(x) * 10.0;
            while((double) rand() / RAND_MAX < 0.9);
        }
        {
            tbb::spin_mutex::scoped_lock s(print_mutex);
            fprintf(stderr, "%f\n", x);
        }
    }
    size_t _n;
};

template <typename T> struct invoker {
    void operator()(T& it) const { it(); }
};

void run()
{
    // use all 8 logical cores
    SetProcessAffinityMask(GetCurrentProcess(), 0xFF);

    printf("numTasks = %d\n", NTASKS);
    for (int t = tbb::task_scheduler_init::automatic;
         t <= 512; t = (t>0) ? t*2 : 1)
    {
        tbb::task_scheduler_init init(t);

        std::vector<mytask> tasks;
        for (int i=0; i<NTASKS; ++i) {
            tasks.push_back(mytask(i));
        }

        ConcurrencyProfiler::Reset();
        tbb::parallel_for_each(tasks.begin(), tasks.end(), invoker<mytask>());

        printf("pool_init(%d) -> %d worker threads\n", t,
            ConcurrencyProfiler::GetMaxNumThreads());
    }
}

#ifdef MATLAB_MEX_FILE
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
    run();
}
#else
int main()
{
    run();
    return 0;
}
#endif

Here is the code for a simple helper class used to profile concurrency by keeping track of how many workers were invoked from the thread pool. You could always use Intel VTune or any other profiling tool to get the same kind of information:

tbb_helpers.hxx

#ifndef HELPERS_H
#define HELPERS_H

#include "tbb/atomic.h"

class ConcurrencyProfiler
{
public:
    ConcurrencyProfiler();
    ~ConcurrencyProfiler();
    static void Reset();
    static size_t GetMaxNumThreads();
private:
    static void RecordMax();
    static tbb::atomic<size_t> cur_count;
    static tbb::atomic<size_t> max_count;
};

#endif

tbb_helpers.cxx

#include "tbb_helpers.hxx"

tbb::atomic<size_t> ConcurrencyProfiler::cur_count;
tbb::atomic<size_t> ConcurrencyProfiler::max_count;

ConcurrencyProfiler::ConcurrencyProfiler()
{
    ++cur_count;
    RecordMax();
}

ConcurrencyProfiler::~ConcurrencyProfiler()
{
    --cur_count;
}

void ConcurrencyProfiler::Reset()
{
    cur_count = max_count = 0;
}

size_t ConcurrencyProfiler::GetMaxNumThreads()
{
    return static_cast<size_t>(max_count);
}

// Performs: max_count = max(max_count,cur_count)
// http://www.threadingbuildingblocks.org/
//    docs/help/tbb_userguide/Design_Patterns/Compare_and_Swap_Loop.htm
void ConcurrencyProfiler::RecordMax()
{
    size_t o;
    do {
        o = max_count;
        if (o >= cur_count) break;
    } while(max_count.compare_and_swap(cur_count,o) != o);
}

First I compile the code as a native executable (I am using Intel C++ Composer XE 2013 SP1, with VS2012 Update 4):

C:\> vcvarsall.bat amd64
C:\> iclvars.bat intel64 vs2012
C:\> icl /MD test_tbb.cpp tbb_helpers.cxx tbb.lib

I run the program in the system shell (Windows 8.1). It goes up to 100% CPU utilization and I get the following output:

C:\> test_tbb.exe 2> nul
numTasks = 100
pool_init(-1) -> 8 worker threads          // task_scheduler_init::automatic
pool_init(1) -> 1 worker threads
pool_init(2) -> 2 worker threads
pool_init(4) -> 4 worker threads
pool_init(8) -> 8 worker threads
pool_init(16) -> 16 worker threads
pool_init(32) -> 32 worker threads
pool_init(64) -> 64 worker threads
pool_init(128) -> 98 worker threads
pool_init(256) -> 100 worker threads
pool_init(512) -> 98 worker threads

As expected, the thread pool is initialized as large as we asked, and being fully utilized being limited by the number of tasks we created (in the last case we have 512 threads for only 100 parallel tasks!).

Next I compile the code as a MEX-file:

>> mex -I"C:\Program Files (x86)\Intel\Composer XE\tbb\include" ...
   -largeArrayDims test_tbb.cpp tbb_helpers.cxx ...
   -L"C:\Program Files (x86)\Intel\Composer XE\tbb\lib\intel64\vc11" tbb.lib

Here is the output I get when I run the MEX-function in MATLAB:

>> test_tbb()
numTasks = 100
pool_init(-1) -> 4 worker threads
pool_init(1) -> 4 worker threads
pool_init(2) -> 4 worker threads
pool_init(4) -> 4 worker threads
pool_init(8) -> 4 worker threads
pool_init(16) -> 4 worker threads
pool_init(32) -> 4 worker threads
pool_init(64) -> 4 worker threads
pool_init(128) -> 4 worker threads
pool_init(256) -> 4 worker threads
pool_init(512) -> 4 worker threads

As you can see, no matter what we specify as pool size, the scheduler always spins at most 4 threads to execute the parallel tasks (4 being the number of physical processors on my quad-core machine). This confirms what I stated in the beginning of the post.

Note that I explicitly set the processor affinity mask to use all 8 cores, but since there are only 4 running threads, CPU usage stayed approximately at 50% in this case.

Hope this helps answer the question, and sorry for the long post :)

This certainly suggests that the thread pool is initialized and limited before the MEX function gets a chance to initialize it's own, and jives with what the OP and myself observed. Is the answer that an existing task scheduler cannot be modified once initialized? There are certainly no [methods of task_scheduler_init](http://www.threadingbuildingblocks.org/docs/doxygen/a00152.html) that allow such changes... — chappjc, Jun 18 '14 at 23:56
that is my conclusion as well; I don't think we can change the thread pool size once the task scheduler has been initialized and not terminated (which is done my MATLAB internally) — Amro, Jun 19 '14 at 00:08
@Amro Thank you Amro for the demonstration. It supports my guess. Today I start to implement OMP within MEX and get expected 8 threads and 100% cpu usage. BUT BUT OMP takes much longer time to finish the same task, most of its work is invoking libiomp5md.dll. Later I may post on another page to discuss this... — yfeng, Jun 19 '14 at 23:07

score 1 · Answer 2 · edited May 23 '17 at 11:57

Assuming you have more than 4 physical cores on your machine, the affinity mask for the MATLAB standalone process is probably limiting the available CPUs. Functions called from an actual MATLAB installation should have the use of all CPUs, but this may not be the case for standalone MATLAB applications generated with the MATLAB Compiler. Try the test again, running the MEX function directly from MATLAB. In any case, you should be able to reset the affinity mask to make all cores available to TBB, but I do not think you this approach will let you coerce TBB to start more threads than you have physical cores.

Background

Since TBB 3.0 update 4, processor affinity settings are referenced to determine the number of available cores, according to a developer blog:

So the only thing that TBB should do instead of asking the system how many CPUs it has, is to retrieve the current process affinity mask, count the number of non-zero bits in it, and voilà, TBB uses no more worker threads than necessary! And this is exactly what TBB 3.0 Update 4 does. Clarifying the statement in the end of my previous blog TBB’s methods tbb::task_scheduler_init::default_num_threads() and tbb::tbb_thread::hardware_concurrency() return not simply the total number of logical CPUs in the system or the current processor group, but rather the number of CPUs available to the process in accordance with its affinity settings.

Similarly, the docs for tbb::default_num_threads indicate this change:

Before TBB 3.0 U4 this method returned the number of logical CPU in the system. Currently on Windows, Linux and FreeBSD it returns the number of logical CPUs available to the current process in accordance with its affinity mask.

The docs for tbb::task_scheduler_init::initialize also suggest that the number of threads is "limited by the processor affinity mask".

Resolution

To check if you are being limited by the affinity mask, Windows .NET functions are available:

numCoresInSystem = 16;
proc = System.Diagnostics.Process.GetCurrentProcess();
dec2bin(proc.ProcessorAffinity.ToInt32,numCoresInSystem)

The output string should have no zeros in any position representing a real (present in the system) core.

You can set the affinity mask in MATLAB or C, as described in the Q&A, Set processor affinity for MATLAB engine (Windows 7). The MATLAB way:

proc = System.Diagnostics.Process.GetCurrentProcess();
proc.ProcessorAffinity = System.IntPtr(int32(2^numCoresInSystem-1));
proc.Refresh()

Or using the Windows API, in a mexFunction, before calling task_scheduler_init:

SetProcessAffinityMask(GetCurrentProcess(),(1 << N) - 1)

For *nix, you can call taskset:

system(sprintf('taskset -p %d %d',2^N - 1,feature('getpid')))

Thank you, chappjc, it's very useful to know the newly update in TBB. Unfortunately I've tried your solution both in matlab and Window API way, it doesn't work out. I checked the affinity mask on my computer, it's not limited(laptop has 8 logical cores, 4 hardware cores, and output are 8 ones). Then I still set affinity mask as you suggested, numCoresInSystem=8 in my case, no matter how many threads I tried to initialize for TBB, TBB in MEX always initialize 4 threads, and MEXtest.exe utilizes ~50% cpu. I also tried to directly run MEX function in matlab and its performance is about the same. — yfeng, Jun 18 '14 at 19:19
@yfeng That's too bad. I was able to reduce parallelism via the affinity mask, so I figured you were facing similar issues. However, I too maxed out at 50%, but that is all of the physical cores (the others are hyperthreads). TBB seems smart enough not to use more than the number of physical cores regardless of the processor affinity mask. I'm assuming you actually have more than 4 computational cores... right? — chappjc, Jun 18 '14 at 22:50
@yfeng Or should TBB allow you to use more threads than cores? — chappjc, Jun 18 '14 at 23:03
I have 4 hardware cores(8 logical cores) on my laptop. Normally TBB would initialize 8 threads on my laptop if being initialized by default. I'm also able to initialize TBB any number of threads I want, like 15, on my laptop in pure C++. No matter 8 or 15, CPU will get 100% usage. However, it's not the case using the same code in a MEX file. TBB in MEX always generates 4 threads on my laptop, no matter what num of thread I specified.... — yfeng, Jun 18 '14 at 23:14
I'm wondering if MEX uses TBB and does initialization before I do, so my TBB initialization gets ignored. Just a guess. — yfeng, Jun 18 '14 at 23:17
Get MEX to use 100% cpu usage is my purpose. TBB initialization is an issue I just come across when looking into it. — yfeng, Jun 18 '14 at 23:19
@yfeng I see the difference you have between MEX and non-MEX. It sounds like TBB may initialize before you, as you said. Both tbbmalloc.dll and tbb.dll are under the MATLAB root. But why do you you want to use all 8 cores? You generally get very little benefit out of hyperthreading. — chappjc, Jun 18 '14 at 23:25
Both MATLAB and the MCR have no limit of the number of cores being used (unless you messed with the affinity yourself!), the affinity mask should default to using all processors. My explanation (same as what you guessed) is that MATLAB initializes TBB for its own use, and changing the thread pool size in your code will be ignored... Please see my answer — Amro, Jun 18 '14 at 23:38

TBB acting strange in Matlab Mex file

2 Answers2

test_tbb.cpp

tbb_helpers.hxx

tbb_helpers.cxx