In this link std::function vs template there is a nice discussion about the overhead of std::function. Basically, to avoid the 10x overhead caused by heap allocation of the functor you pass to the std::function constructor, you must use std::ref or std::cref.
Example taken from @CassioNeri answer that shows how pass lambdas to std::function by reference.
float foo(std::function<float(float)> f) { return -1.0f * f(3.3f) + 666.0f; }
foo(std::cref([a,b,c](float arg){ return arg * 0.5f; }));
Now, Intel Thread Building Block library gives you the ability to parallel evaluate loops using lambda/functors, as shown in the example below.
Example code:
#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h"
#include "tbb/tbb_thread.h"
#include <vector>
int main() {
tbb::task_scheduler_init init(tbb::tbb_thread::hardware_concurrency());
std::vector<double> a(1000);
std::vector<double> c(1000);
std::vector<double> b(1000);
std::fill(b.begin(), b.end(), 1);
std::fill(c.begin(), c.end(), 1);
auto f = [&](const tbb::blocked_range<size_t>& r) {
for(size_t j=r.begin(); j!=r.end(); ++j) a[j] = b[j] + c[j];
};
tbb::parallel_for(tbb::blocked_range<size_t>(0, 1000), f);
return 0;
}
So my question is: does Intel TBB parallel_for have the same kind of overhead (heap allocation of the functors) that we see on std::function? Should I pass my functors/lambdas by reference to parallel_for using std::cref to speed up the code?