40

This was already touched in Why C++ lambda is slower than ordinary function when called multiple times? and C++0x Lambda overhead But I think my example is a bit different from the discussion in the former and contradicts the result in the latter.

On the search for a bottleneck in my code I found a recusive template function that processes a variadic argument list with a given processor function, like copying the value into a buffer.

template <typename T>
void ProcessArguments(std::function<void(const T &)> process)
{}

template <typename T, typename HEAD, typename ... TAIL>
void ProcessArguments(std::function<void(const T &)> process, const HEAD &head, const TAIL &... tail)
{
  process(head);
  ProcessArguments(process, tail...);
}

I compared the runtime of a program that uses this code together with a lambda function as well as a global function that copies the arguments into a global buffer using a moving pointer:

int buffer[10];
int main(int argc, char **argv)
{
  int *p = buffer;

  for (unsigned long int i = 0; i < 10E6; ++i)
  {
    p = buffer;
    ProcessArguments<int>([&p](const int &v) { *p++ = v; }, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
  }
}

compiled with g++ 4.6 and -O3 measuring with the tool time takes more than 6 seconds on my machine while

int buffer[10];
int *p = buffer;
void CopyIntoBuffer(const int &value)
{
  *p++ = value;
}

int main(int argc, char **argv)
{
  int *p = buffer;

  for (unsigned long int i = 0; i < 10E6; ++i)
  {
    p = buffer;
    ProcessArguments<int>(CopyIntoBuffer, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
  }

  return 0;
}

takes about 1.4 seconds.

I do not get what is going on behind the scenes that explains the time overhead and am wondering if I can change something to make use of lambda functions without paying with runtime.

Community
  • 1
  • 1
mcbulba
  • 523
  • 1
  • 4
  • 7
  • So is the global one really slower? You say it's 6s vs. 1.4 for the lambda-based one, but then the last sentence does not make sense. – Sergey Kalinichenko Sep 04 '13 at 16:32
  • 1
    When doing your analysis, did you peek at a generated assembly listing? – WhozCraig Sep 04 '13 at 16:33
  • Would passing `process` by `const` reference, like this, `void ProcessArguments(const std::function &process)`, make any difference? – Sergey Kalinichenko Sep 04 '13 at 16:34
  • Uhm, according to your benchmark, using a lambda is almost five times faster than using a function (pointer). Your last sentence doesn’t make sense. (That said, I don’t really believe that benchmark anyway, it’s completely unreliable – too many repetitions, too little work inside the loop) – Konrad Rudolph Sep 04 '13 at 17:30
  • Wow, if I calculate correctly, that's roughly 300 to 1200 cycles per call. That's even more expensive than a `new`, and I already thought that was bad! Apparently there's a _lot_ of crap going on behind the scenes. Let's hope the compiler developers can fix this, otherwise lambdas are dead for performance programming... – cmaster - reinstate monica Sep 04 '13 at 18:01
  • 3
    @cmaster Well, hold your horses here. Apart from the fact that this benchmark *is* flawed – even if that’s true it only counts for closures. Non-closure lambdas have *no* overhead. That’s a fact. And closures also don’t, in usual code – OP is cheating, the non-lambda code is using a global object, that’s not normally an option: generally you’d have to implement the closure manually, and then you’d once again have the exact same overhead as the with a lambda. If anything, this is an argument against using `std::function` (in specific scenarios), not against lambdas. – Konrad Rudolph Sep 04 '13 at 19:12
  • @dasblinkenlight (and also Konrad Rudolph): Yeah, you are right. I switched the measured times. I am sorry for that and edited the question accordingly. Also, the main problem is the use of std::function and copying as pointed out in the answer by biocomp and not lambda. Anyway, std::function with copying behaves worse with lambda than the global function. – mcbulba Sep 05 '13 at 12:42
  • 1
    ultimately a duplicate of [What is the performance overhead of std::function?](https://stackoverflow.com/questions/5057382/what-is-the-performance-overhead-of-stdfunction) – underscore_d Dec 28 '17 at 12:33

1 Answers1

60

The problem here is your usage of std::function. You send it by copy and therefore copying its contents (and doing that recursively as you unwind parameters).

Now, for pointer to function, contents is, well, just pointer to function. For lambda, contents are at least pointer to function + reference that you captured. This is twice as much to copy. Plus, because of std::function's type erasure copying any data will most likely be slower (not inlined).

There are several options here, and the best would probably be passing not std::function, but template instead. The benefits are that your method call is more likely to be inlined, no type erasure happens by std::function, no copying happens, everything is so very good. Like that:

template <typename TFunc>
void ProcessArguments(const TFunc& process)
{}

template <typename TFunc, typename HEAD, typename ... TAIL>
void ProcessArguments(const TFunc& process, const HEAD &head, const TAIL &... tail)
{
  process(head);
  ProcessArguments(process, tail...);
}

Second option is doing the same, but sending the process by copy. Now, copying does happen, but still is neatly inlined.

What's equally important is that process' body can also be inlined, especially for lamda. Depending on complexity of copying the lambda object and its size, passing by copy may or may not be faster than passing by reference. It may be faster because compiler may have harder time reasoning about reference than the local copy.

template <typename TFunc>
void ProcessArguments(TFunc process)
{}

template <typename TFunc, typename HEAD, typename ... TAIL>
void ProcessArguments(TFunc process, const HEAD &head, const TAIL &... tail)
{
  process(head);
  ProcessArguments(process, tail...);
}

Third option is, well, try passing std::function<> by reference. This way you at least avoid copying, but calls will not be inlined.

Here are some perf results (using ideones' C++11 compiler). Note that, as expected, inlined lambda body is giving you best performance:

Original function:
0.483035s

Original lambda:
1.94531s


Function via template copy:
0.094748

### Lambda via template copy:
0.0264867s


Function via template reference:
0.0892594s

### Lambda via template reference:
0.0264201s


Function via std::function reference:
0.0891776s

Lambda via std::function reference:
0.09s
Artem Tokmakov
  • 1,135
  • 10
  • 7
  • 1
    An interesting read! I'm just wondering, why would you suggest lambda via template copy? I mean, what's wrong with reference? are there advantages to adding the copy? – Guy Feb 01 '17 at 06:54
  • 2
    I did not suggest copy over reference :) What I said was that there are all those options. Maybe it was not very clear. And you can't be sure which one (passing lambda by reference or by copy) is better. Depending on contents of the lambda (its size), its copying complexity, passing by reference may or may not be faster. Compilers may have harder time reasoning about references than about objects passed by value. – Artem Tokmakov Feb 13 '17 at 19:14
  • maybe it's because I'm not a native English speaker, but this sentence confused me _you could **actually** do the same_ (emphasis is mine). The word 'actually' led me to think you meant that this is better... thanks for clarifying. – Guy Feb 14 '17 at 11:15
  • 2
    As you may have guessed, I am neither a native English speaker ) I updated the wording for passing by copy, hope this is a little better. – Artem Tokmakov Feb 15 '17 at 00:56