Is there an overhead on parallel_for (Inter TBB) similar to the overhead we see on std::function?

Question

In this link std::function vs template there is a nice discussion about the overhead of std::function. Basically, to avoid the 10x overhead caused by heap allocation of the functor you pass to the std::function constructor, you must use std::ref or std::cref.

Example taken from @CassioNeri answer that shows how pass lambdas to std::function by reference.

float foo(std::function<float(float)> f) { return -1.0f * f(3.3f) + 666.0f; }
foo(std::cref([a,b,c](float arg){ return arg * 0.5f; }));

Now, Intel Thread Building Block library gives you the ability to parallel evaluate loops using lambda/functors, as shown in the example below.

Example code:

#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h"
#include "tbb/tbb_thread.h"
#include <vector>

int main() {
 tbb::task_scheduler_init init(tbb::tbb_thread::hardware_concurrency());
 std::vector<double> a(1000);
 std::vector<double> c(1000);
 std::vector<double> b(1000);

 std::fill(b.begin(), b.end(), 1);
 std::fill(c.begin(), c.end(), 1);

 auto f = [&](const tbb::blocked_range<size_t>& r) {
  for(size_t j=r.begin(); j!=r.end(); ++j) a[j] = b[j] + c[j];    
 };
 tbb::parallel_for(tbb::blocked_range<size_t>(0, 1000), f);
 return 0;
}

So my question is: does Intel TBB parallel_for have the same kind of overhead (heap allocation of the functors) that we see on std::function? Should I pass my functors/lambdas by reference to parallel_for using std::cref to speed up the code?

@ArchD.Robison not sure is the current answer closes this question. You seem the perfect programmer to answer it. Could you help me — Vivian Miranda, Sep 01 '13 at 07:11

score 2 · Answer 1 · edited May 23 '17 at 12:14

Should I pass my functors/lambdas by reference to parallel_for using std::cref to speed up the code?

I don't know the answer to your main question. But it doesn't matter because you should never do that with tbb::parallel_for.

As Cassio Neri pointed out in his answer:

Finally, notice that the lifetime of the lambda encloses that of the std::function.

That was true for the circumstances of the question he was asking. But this is not true for tbb::parallel_for. The entire point of parallel_for is that it will call the given function from other threads at an arbitrary time in the future.

If you give it some functor by reference, then you must ensure that this functor's lifetime continues until the parallel_for is finished. Otherwise, parallel_for may try to call a reference to a destroyed object.

That's bad.

So regardless of whatever overhead may happen, you can't cure it with references.

score 2 · Accepted Answer · answered Sep 01 '13 at 20:56

Passing the functor using std::cref is likely to be counterproductive, but I make no promises. Only empirical testing in the precise context of interest can be definitive. In general, for tbb::parallel_for, my recommendation is:

Pass the lambda by value.
Unless there are semantic considerations that dictate a capture mode, have the lambda objects by reference unless they are small objects that are cheap to copy. Remember that typically the captured variables will accessed many more times than the lambda is copied.

Does TBB pay the cost of heap allocation for the functor? The answer is definitely no for the signature of the form parallel_for(first,*last*,functor), because that form passes the functor by reference.

For the signature of the form parallel_for(range,*functor*), as in the question, the answer is "no additional cost". It does not heap-allocate the functor directly. But each task that TBB creates has a copy of the functor, and the tasks are heap-allocated (usually quickly via local free-lists). Using std::cref is not going to change the fact that the tasks are heap allocated. Using std::cref will just add an extra level of indirection.

I was actually a little surprised that one form of tbb::parallel_for passes the functor by reference and another by value. I forget the reason, and I'm sure the TBB group must have debated it. The choice may have been motivated by whatever benchmarks and machines were available at the time each was introduced, or maybe it's a PPL compatibility issue with the "first,last" form, which seems to not require that the functor be copy-constructible. As hinted at earlier, the performance tradeoff of passing-by-reference versus passing-by-value is not simple. The passing-by-reference makes passing the functor around cheap, but adds the cost of indirection to each time it is accessed (unless the compiler can optimize it away).

As to the lifetime of the functor argument, it just has to exist for the duration of the call to parallel_for.

Is there an overhead on parallel_for (Inter TBB) similar to the overhead we see on std::function?

2 Answers2