1

I have an O(N^4) scaling algorithm of the form

...
...
...
for (unsigned i = 0; i < nI; ++i) {
  for (unsigned j = 0; j < nJ; ++j) {
    for (unsigned k = 0; k < nK; ++k) {
      for (unsigned l = 0; l < nL; ++l) {
        *calculate value*
        *do something with value*
      }   
    }
  }
}

I need this code in a couple of places so I put it a looper function as part of a class. This loop function is templated so that it can accept a lambda function which takes care of *do something with value*.

Some tests have shown that this is not optimal performance-wise but I do not have any idea on how to get around explicitly writing out this code every time I need it. Do you see a way of doing this?

EigenGrau
  • 57
  • 5

1 Answers1

6

Using a templated function to call the lambda should generate a code that can be optimized by modern optimizing compilers. It is actually the case for the last version of GCC, Clang and MSVC. You can check that on GodBolt with this code:

extern int unknown1();
extern int unknown2(int);

template <typename LambdaType>
int compute(LambdaType lambda, int nI, int nJ, int nK, int nL)
{
    int sum = 0;

    for (unsigned i = 0; i < nI; ++i) {
        for (unsigned j = 0; j < nJ; ++j) {
            for (unsigned k = 0; k < nK; ++k) {
                for (unsigned l = 0; l < nL; ++l) {
                    sum += lambda(i, j, k, l);
                }   
            }
        }
    }

    return sum;
}


int caller(int nI, int nJ, int nK, int nL)
{
    int context = unknown1();

    auto lambda = [&](int i, int j, int k, int l) -> int {
        return unknown2(context + i + j + k + l);
    };

    return compute(lambda, nI, nJ, nK, nL);
}

Using optimization flags, GCC, Clang and MSVC are capable of generating an efficient implementation of compute eliding the lambda calls in the 4 nested loops (unknown2 is directly called in the generated assembly). This is the case even if compute is not inlined. Note the fact that the lambda capture its context do not actually prevent optimisations (although this is much harder for the compiler to optimize this case).

Note that this is important not to use the direct lambda type and not wrappers like std::function as wrapper will likely prevent optimizations (or at least make optimizations much more difficult to apply) resulting in direct function calls. Indeed, the type help the compiler to inline the function and then apply further optimizations like vectorization and constant propagation.

Note that the code of the lambda should be kept small. Otherwise, it may not be inlined resulting in a function call. A direct function call is not so slow with if the function body is pretty big on modern processors because of good branch prediction units and relatively fast large caches. However, the cost of preventing further optimizations mostly possible due to the lambda inlining can be huge. One way to mitigate this cost is to move at least one loop in the lambda (see Data-oriented design for more information). Another solution is to use OpenMP to help the compiler vectorizing the lambda thanks to #pragma omp declare simd [...] directives (assuming your compiler supports it). You can also play with compiler inlining command-line parameters to tell your compiler to actually inline the lambda in such a case.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 1
    Related: [Why would a C++ compiler fail to inline a lambda passed to a function template?](https://stackoverflow.com/q/58686073) shows an example of making a lambda so big that GCC decides not to inline if you use it twice. – Peter Cordes Oct 24 '22 at 14:05