Using a templated function to call the lambda should generate a code that can be optimized by modern optimizing compilers. It is actually the case for the last version of GCC, Clang and MSVC. You can check that on GodBolt with this code:
extern int unknown1();
extern int unknown2(int);
template <typename LambdaType>
int compute(LambdaType lambda, int nI, int nJ, int nK, int nL)
{
int sum = 0;
for (unsigned i = 0; i < nI; ++i) {
for (unsigned j = 0; j < nJ; ++j) {
for (unsigned k = 0; k < nK; ++k) {
for (unsigned l = 0; l < nL; ++l) {
sum += lambda(i, j, k, l);
}
}
}
}
return sum;
}
int caller(int nI, int nJ, int nK, int nL)
{
int context = unknown1();
auto lambda = [&](int i, int j, int k, int l) -> int {
return unknown2(context + i + j + k + l);
};
return compute(lambda, nI, nJ, nK, nL);
}
Using optimization flags, GCC, Clang and MSVC are capable of generating an efficient implementation of compute
eliding the lambda calls in the 4 nested loops (unknown2
is directly called in the generated assembly). This is the case even if compute
is not inlined. Note the fact that the lambda capture its context do not actually prevent optimisations (although this is much harder for the compiler to optimize this case).
Note that this is important not to use the direct lambda type and not wrappers like std::function
as wrapper will likely prevent optimizations (or at least make optimizations much more difficult to apply) resulting in direct function calls. Indeed, the type help the compiler to inline the function and then apply further optimizations like vectorization and constant propagation.
Note that the code of the lambda should be kept small. Otherwise, it may not be inlined resulting in a function call. A direct function call is not so slow with if the function body is pretty big on modern processors because of good branch prediction units and relatively fast large caches. However, the cost of preventing further optimizations mostly possible due to the lambda inlining can be huge. One way to mitigate this cost is to move at least one loop in the lambda (see Data-oriented design for more information). Another solution is to use OpenMP to help the compiler vectorizing the lambda thanks to #pragma omp declare simd [...]
directives (assuming your compiler supports it). You can also play with compiler inlining command-line parameters to tell your compiler to actually inline the lambda in such a case.