4

Motivation

I created a header file which wraps Matlab's mex functionality in c++11 classes; especially for MxNxC images. Two functions I created are forEach, which iterates over each pixel in the image, and also a forKernel, which given a kernel and pixel in the image, iterates over the kernel around that pixel, handling all kinds of nifty, boiler-plate indexing mathematics.

The idea is that one could program sliding-windows like this:

image.forEach([](Image &image, size_t row, size_t col) {
  //kr and lc specify which pixel is the center of the kernel
  image.forKernel<double>(row, col, kernel, kr, kc, [](Image &image, double w, size_t row, size_t col) {
    // w is the weight/coefficient of the kernel, row/col are the corresponding coordinates in the image.
    // process ...
  });
});

Problem

This provides a nice way to

  • increase readability: the two function calls are a lot clearer than the corresponding 4 for-loops to do the same,
  • stay flexible: lambda functions allow you to scope all kinds of variables by value or reference, which are invisible to the implementer of forEach / forKernel, and
  • increase execution time, unfortunately: this executes around 8x slower than using just for loops.

The latter point is the problem, of course. I was hoping g++ would be able to optimize the lambda-functions out and inline all the code. This does not happen. Hence I created a minimal working example on 1D data:

#include <iostream>
#include <functional>

struct Data {
  size_t d_size;
  double *d_data;
  Data(size_t size) : d_size(size), d_data(new double[size]) {}
  ~Data() { delete[] d_data; }
  double &operator[](size_t i) { return d_data[i]; }


  inline void forEach(std::function<void(Data &, size_t)> f) {
    for (size_t index = 0; index != d_size; ++index)
      f(*this, index);
  }
};



int main() {
  Data im(50000000);
  im.forEach([](Data &im, size_t i) {
    im[i] = static_cast<double>(i);
  });

  double sum = 0;
  im.forEach([&sum](Data &im, size_t i) {
    sum += im[i];
  });

  std::cout << sum << '\n';
}

source: http://ideone.com/hviTwx

I'm guessing the compiler is not able to compile the code for forEach per lambda-function, as the lambda function is not a template variable. The good thing is that one can compile once and link to it more often with different lambda functions, but the bad thing is that it is slow.

Moreover, the situation discussed in the motivation already contains templates for the data type (double, int, ...), hence the 'good thing' is overruled anyway.

A fast way to implement the previous would be like this:

#include <iostream>
#include <functional>

struct Data {
  size_t d_size;
  double *d_data;
  Data(size_t size) : d_size(size), d_data(new double[size]) {}
  ~Data() { delete[] d_data; }
  double &operator[](size_t i) { return d_data[i]; }
};



int main() {
  size_t len = 50000000;
  Data im(len);
  for (size_t index = 0; index != len; ++index)
    im[index] = static_cast<double>(index);

  double sum = 0;
  for (size_t index = 0; index != len; ++index)
    sum += im[index];

  std::cout << sum << '\n';
}

source: http://ideone.com/UajMMz

It is about 8x faster, but also less readable, especially when we consider more complicated structures like images with kernels.

Question

Is there a way to provide the lambda function as a template argument, such that forEach is compiled for each call, and optimized for each specific instance of the lambda function? Can the lambda function be inlined somehow, since lambda functions are typically not recursive this should be trivial, but what is the syntax?

I found some related posts:

But they do not give a solution in the form of a minimal working example, and they do not discuss the possibility of inlining a lambda function. The answer to my question should do that: change the Data.forEach member function and it's call such that is as fast as possible / allows for as many running time optimizations (not optimizations at run time, but at compile time that decrease runtime) as possible.

Regarding the suggestion of forEveR

Thank you for creating that fix, it's a huge improvement yet still approximately 2x as slow:

Results:

herbert@machine ~ $ g++ -std=c++11 -Wall test0.cc -o test0
herbert@machine ~ $ g++ -std=c++11 -Wall test1.cc -o test1
herbert@machine ~ $ g++ -std=c++11 -Wall test2.cc -o test2
herbert@machine ~ $ time ./test0
1.25e+15

real    0m2.563s
user    0m2.541s
sys     0m0.024s
herbert@machine ~ $ time ./test1
1.25e+15

real    0m0.346s
user    0m0.320s
sys     0m0.026s
herbert@machine ~ $ time ./test2
1.25e+15

real    0m0.601s
user    0m0.575s
sys     0m0.026s
herbert@machine ~ $ 

I re-ran the code with -O2, which fixes the problem. runtimes of test1 and test2 ar now very similar. Thank you @stijn and @forEveR.

herbert@machine ~ $ g++ -std=c++11 -Wall -O2 test0.cc -o test0
herbert@machine ~ $ g++ -std=c++11 -Wall -O2 test1.cc -o test1
herbert@machine ~ $ g++ -std=c++11 -Wall -O2 test2.cc -o test2

herbert@machine ~ $ time ./test0
1.25e+15

real    0m0.256s
user    0m0.229s
sys 0m0.028s
herbert@machine ~ $ time ./test1
1.25e+15

real    0m0.111s
user    0m0.078s
sys 0m0.033s
herbert@machine ~ $ time ./test2
1.25e+15

real    0m0.108s
user    0m0.076s
sys 0m0.032s
herbert@machine ~ $ 
Community
  • 1
  • 1
Herbert
  • 5,279
  • 5
  • 44
  • 69
  • Dont have a clue but +1, this should be an example of a good question – Ander Biguri Nov 21 '14 at 10:14
  • 3
    I can't understand, what is problem to use template parameter, instead of `std::function`? http://ideone.com/8kR3Mw – ForEveR Nov 21 '14 at 10:20
  • Does that inline the lambda function, as if the for-loop and the code in the lambda-function are one code block, of is a function called each loop-iteration? – Herbert Nov 21 '14 at 10:24
  • regarding your two minimal samples: the first one uses std::function, the second one doesn't. It doesn't seem correct to compare those. Or is your question 'why doesn't gcc optimize away std::function'? – stijn Nov 21 '14 at 10:28
  • My question would be, how can I have the best of both worlds? How can I create a function that takes a piece of code as input. You can represent the piece of code as a (lambda) function, but it's really far less than a function, since it does not do recursion. Hence the idea of local variable is just 'syntax sugar' (imho) and should not decrease execution speed if one allows for inlining the lambda function. – Herbert Nov 21 '14 at 10:33
  • Use ForEveR's solution, it should do what you want - if not, you need a better compiler. – stijn Nov 21 '14 at 10:34
  • Better than g++ (Ubuntu 4.9.1-3ubuntu2~14.04.1) 4.9.1 ? I mean, what are you suggesting? Which compiler should I use? – Herbert Nov 21 '14 at 10:35
  • @Herbert how about compile with optimizations? `g++ -O2 -std=c++11 -Wall test0.cc -o test0` – ForEveR Nov 21 '14 at 10:37
  • I am suggesting you try the code and verify that the sum loop in your example gets squashed to 5 lines of assembly code with everything inlined. That's the best one could get out of it (as it's the same as handwriting the loop). If that does not work, you need to check compiler flags. If that still fails, you need a better compiler cause it's rather trivial for the optimizing stage to figure this out. – stijn Nov 21 '14 at 10:41
  • Both of you are right, I will append it to my OP. @ForEveR, feel free to move my comments to your answer if you wish. – Herbert Nov 21 '14 at 10:42

1 Answers1

6

Problem is, that you use std::function, that actually use type-erasure and virtual calls. You can simply use template parameter, instead of std::function. Call of lambda function will be inlined, due n3376 5.1.2/5

The closure type for a lambda-expression has a public inline function call operator (13.5.4) whose param- eters and return type are described by the lambda-expression’s parameter-declaration-clause and trailing- return-type respectively

So, just simply write

  template<typename Function>
  inline void forEach(Function f) {
    for (size_t index = 0; index != d_size; ++index)
      f(*this, index);
  }

Live example

ForEveR
  • 55,233
  • 2
  • 119
  • 133
  • This seems like a reasonable solution, the overhead from 'raw code' is acceptable and it is good to read that lambda function's are inlined. However your suggestion is slower than raw code: see the section regarding you in the OP. Would you know why that is? – Herbert Nov 21 '14 at 10:39
  • (I've ran the crude runtime tests several times to check is the results are consistent, and not due to some side effect of my OS, and I have 4 cores on a machine which is doing basically nothing :) ) – Herbert Nov 21 '14 at 10:40
  • @Herbert compile with optimizations. On my machine results are practically identical. – ForEveR Nov 21 '14 at 10:41
  • Verified and agreed, see end of OP. TY. – Herbert Nov 21 '14 at 10:44