16

I wanted to time a few functions' execution and I've written myself a helper:

using namespace std;
template<int N = 1, class Fun, class... Args>
void timeExec(string name, Fun fun, Args... args) {

    auto start = chrono::steady_clock::now();

    for(int i = 0; i < N; ++i) {
        fun(args...);
    }

    auto end = chrono::steady_clock::now();

    auto diff = end - start;
    cout << name << ": "<< chrono::duration<double, milli>(diff).count() << " ms. << endl;
}

I figured that for timing member functions this way I'd have to use bind or lambda and I wanted to see which would impact the performance less, so I did:

const int TIMES = 10000;
timeExec<TIMES>("Bind evaluation", bind(&decltype(result)::eval, &result));
timeExec<1>("Lambda evaluation", [&]() {
    for(int i = 0; i < TIMES; ++i) {
        result.eval();
    }
});

The results are:

Bind evaluation: 0.355158 ms.
Lambda evaluation: 0.014414 ms.

I don't know the internals, but I assume that lambda cannot be that better than bind. The only plausible explanation I can think of is the compiler optimizing-out subsequent function evaluations in the lambda's loop.

How would you explain it?

Adam Kosiorek
  • 1,438
  • 1
  • 13
  • 17
  • 1
    The lambda body can be inlined. The bind expression may not be inlinable. Check the machine code to be sure. Also try `Fun && fun`. – Kerrek SB Jul 20 '14 at 17:41
  • One more thing: I came up with it when learning expression templates. I've got a computation graph that can be evaluated either at run-time or at compile-time. The described behaviour occurs in case of compile-time evaluation, but not the run-time. – Adam Kosiorek Jul 20 '14 at 17:50
  • 1
    You really need to check the assembly to figure out what's going on. – T.C. Jul 20 '14 at 18:16
  • 6
    Is this not simply `timeExec` vs. `timeExec<1>`? – ildjarn Jul 20 '14 at 22:04
  • This is definitely not an apples-to-apples comparison. I'm curious what happens if you do `timeExec("Lambda evaluation", [&] { result.eval(); })` – xaviersjs Jul 23 '18 at 23:28

2 Answers2

22

I assume that lambda cannot be that better than bind.

That's quite a preconception.

Lambdas are tied into the compiler internals, so extra optimization opportunities may be found. Moreover, they're designed to avoid inefficiency.

However, there are probably no compiler optimization tricks happening here. The likely culprit is the argument to bind, bind(&decltype(result)::eval, &result). You are passing a pointer-to-member-function (PTMF) and an object. Unlike the lambda type, the PTMF does not capture what function actually gets called; it only contains the function signature (parameter and return types). The slow loop is using an indirect branch function call, because the compiler failed to resolve the function pointer through constant propagation.

If you rename the member eval() to operator () () and get rid of bind, then the explicit object will essentially behave like the lambda and the performance difference should disappear.

Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
  • 1
    Does this apply to all bind variations, like `std::mem_fn`? Some times it's much shorter to use `std::mem_fn` in stuff like `std::all_of` or `std::any_of`. – The Quantum Physicist Jan 11 '19 at 09:46
13

I've tested it. My results shows, that Lambda is actually faster than bind.

This is the code (please don't look at style):

#include <iostream>
#include <functional>
#include <chrono>

using namespace std;
using namespace chrono;
using namespace placeholders;

typedef void SumDataBlockEventHandler(uint8_t data[], uint16_t len);

class SpeedTest {
    uint32_t sum = 0;
    uint8_t i = 0;
    void SumDataBlock(uint8_t data[], uint16_t len) {
        for (i = 0; i < len; i++) {
            sum += data[i];
        }
    }
public:
    function<SumDataBlockEventHandler> Bind() {
        return bind(&SpeedTest::SumDataBlock, this, _1, _2);
    }
    function<SumDataBlockEventHandler> Lambda() {
        return [this](auto data, auto len)
        {
            SumDataBlock(data, len);
        };
    }
};

int main()
{
    SpeedTest test;
    function<SumDataBlockEventHandler> testF;
    uint8_t data[] = { 0,1,2,3,4,5,6,7 };

#if _DEBUG
    const uint32_t testFcallCount = 1000000;
#else
    const uint32_t testFcallCount = 100000000;
#endif
    uint32_t callsCount, whileCount = 0;
    auto begin = high_resolution_clock::now();
    auto end = begin;

    while (whileCount++ < 10) {
        testF = test.Bind();
        begin = high_resolution_clock::now();
        callsCount = 0;
        while (callsCount++ < testFcallCount)
            testF(data, 8);
        end = high_resolution_clock::now();
        cout << testFcallCount << " calls of binded function: " << duration_cast<nanoseconds>(end - begin).count() << "ns" << endl;

        testF = test.Lambda();
        begin = high_resolution_clock::now();
        callsCount = 0;
        while (callsCount++ < testFcallCount)
            testF(data, 8);
        end = high_resolution_clock::now();
        cout << testFcallCount << " calls of lambda function: " << duration_cast<nanoseconds>(end - begin).count() << "ns" << endl << endl;
    }
    system("pause");
}

Console results (Release with optimalization):

100000000 calls of binded function: 1846298524ns
100000000 calls of lambda function: 1048086461ns

100000000 calls of binded function: 1259759880ns
100000000 calls of lambda function: 1032256243ns

100000000 calls of binded function: 1264817832ns
100000000 calls of lambda function: 1039052353ns

100000000 calls of binded function: 1263404007ns
100000000 calls of lambda function: 1031216018ns

100000000 calls of binded function: 1275305794ns
100000000 calls of lambda function: 1041313446ns

100000000 calls of binded function: 1256565304ns
100000000 calls of lambda function: 1031961675ns

100000000 calls of binded function: 1248132135ns
100000000 calls of lambda function: 1033890224ns

100000000 calls of binded function: 1252277130ns
100000000 calls of lambda function: 1042336736ns

100000000 calls of binded function: 1250320869ns
100000000 calls of lambda function: 1046529458ns

I've compiled it under Visual Studio Enterprise 2015 in the Release mode with Full Optimization (/ Ox) and in the Debug mode with disabled optimalization. Results confirm that lambda is faster than the bind on my laptop (Dell Inspiron 7537, Intel Core i7-4510U 2.00GHz, 8GB RAM).

Can anyone verify this on your computer?

Paweł Iwaneczko
  • 853
  • 1
  • 10
  • 13
  • 3
    OnIntel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz - 32GB Ram - clang++-4.0 libstdc++ lambda is 1/3 of the time used in comparison to bind. Compiled with -O3 bind and lambda are almost equal, but bind is still 5% slower. Compiling without loop unrolling confirms the results. – daminetreg Oct 25 '17 at 13:12
  • 1
    AMD Ryzen 5 3600 16 GB RAM - VS Release x64: Lambda isn't always faster, but most of the time there is as >5% improvement of lambda over bind. – JMRC Feb 21 '22 at 06:12