C++11 tuple performance

Question

I just about to make my code more generalized by using std::tuple in a lot of cases including single element. I mean for example tuple<double> instead of double. But I decided to check performance of this particular case.

Here is simple performance benchmark test:

#include <tuple>
#include <iostream>

using std::cout;
using std::endl;
using std::get;
using std::tuple;

int main(void)
{

#ifdef TUPLE
    using double_t = std::tuple<double>;
#else
    using double_t = double;
#endif

    constexpr int count = 1e9;
    auto array = new double_t[count];

    long long sum = 0;
    for (int idx = 0; idx < count; ++idx) {
#ifdef TUPLE
        sum += get<0>(array[idx]);
#else
        sum += array[idx];
#endif
    }
    delete[] array;
    cout << sum << endl; // just "external" side effect for variable sum.
}

And run results:

$ g++ -DTUPLE -O2 -std=c++11 test.cpp && time ./a.out
0  

real    0m3.347s
user    0m2.839s
sys     0m0.485s

$ g++  -O2 -std=c++11 test.cpp && time ./a.out
0  

real    0m2.963s
user    0m2.424s
sys     0m0.519s

I thought that tuple is strict static-compiled template and all of get<> functions are working just usual variable access in that case. BTW memory allocation sizes in this test are same. Why does this execution time difference happens?

EDIT: Problem was in initialization of tuple<> object. To make test more accurate one line must be changed:

     constexpr int count = 1e9;
-    auto array = new double_t[count];
+    auto array = new double_t[count]();

     long long sum = 0;

After that one can observe similar results:

$ g++ -DTUPLE -g -O2 -std=c++11 test.cpp && (for i in $(seq 3); do time ./a.out; done) 2>&1 | grep real
real    0m3.342s
real    0m3.339s
real    0m3.343s

$ g++ -g -O2 -std=c++11 test.cpp && (for i in $(seq 3); do time ./a.out; done) 2>&1 | grep real
real    0m3.349s
real    0m3.339s
real    0m3.334s

Have you run it many times to make sure you get consistent results? — Mike Makuch, Oct 02 '13 at 20:15
@koodawg Yes, I have. Time with tuple is stable at 3.3s, w/o -- 2.9s. — Alexander Sergeyev, Oct 02 '13 at 20:19
@aaronman I raise `count` to 10^9 just to make difference in time more noticeable. I think you can use 10^8 for example. — Alexander Sergeyev, Oct 02 '13 at 20:22
@balki Not yet. I'm not so cool in assembler stuff. So far I can provide disassembled code. — Alexander Sergeyev, Oct 02 '13 at 20:25
@aaronman, this code seems very memory intensive - if you don't have more then 8GB it can be very slow. — zch, Oct 02 '13 at 20:25
@aaronman All code and times measurement in my post synced / matched. — Alexander Sergeyev, Oct 02 '13 at 20:30

score 14 · Accepted Answer · edited Oct 02 '13 at 20:39

14

The tuple all default construct values (so everything is 0) doubles do not get default initialized.

In generated assembly, following initialization loop is only present when using tuples. Otherwise they are equivalent.

.L2:
    movq    $0, (%rdx)
    addq    $8, %rdx
    cmpq    %rcx, %rdx
    jne .L2

edited Oct 02 '13 at 20:39

zch

14,931
2
41
49

answered Oct 02 '13 at 20:26

aaronman

18,343
7
63
78

Confirmed in assembly. It does two loops, one zeroing and one summing. `double` based one does only one. – zch Oct 02 '13 at 20:29
@zch if you want you can edit my answer, not really in an assembly reading mood – aaronman Oct 02 '13 at 20:30
6

Excellent observation. The OP should write `new double_t[count]();` for a fair comparison. – Kerrek SB Oct 02 '13 at 20:33
3

@KerrekSB thanks, I agree he should write a new test, I love proving that c++ is just as fast as c and that you should trust your compiler – aaronman Oct 02 '13 at 20:39
@KerrekSB Confirmed, time matches after this patch. – Alexander Sergeyev Oct 02 '13 at 20:39
Should I put some details in question post? – Alexander Sergeyev Oct 02 '13 at 20:40
@AlexanderSergeev I would put the info about the new test in – aaronman Oct 02 '13 at 20:42
Of course this is proving tuple is just as fast after slowing down the non-tuple version. In order to speed up the tuple version by eliminating the initialization from it you can use `template struct no_initialization { T t; no_initialization() {} operator T& () { return t; } };`. and `using double_t = std::tuple>;` – bames53 Oct 02 '13 at 20:43
1

@bames53: Maybe, but that's really beside the point. If you use `vector>`, all is well, since you only initialize *when you have an object*. It's the pointless creation of objects that you don't need that is the real performance hit. (Another reason why you shouldn't use dynamic, raw arrays.) – Kerrek SB Oct 02 '13 at 20:46
@KerrekSB That's not a panacea to the issue of unnecessary value initialization. Sometimes you need to create a buffer to be filled by another routine, for example. `std::uninitialized_fill` was added for a reason. – bames53 Oct 02 '13 at 20:52
@bames53: Well, maybe, though use of `uninitialized_fill` should be exceptionally rare, and probably limited to *implementations* of `vector` and such like. I wager that the vast majority of every-day container problems can be solved efficiently with a vector with `reserve` and `push_back` etc. `uninitialized_fill` still separates memory from objects, only it allows you to use some kind of typed-pointer like iteration. – Kerrek SB Oct 02 '13 at 20:55

C++11 tuple performance

1 Answers1

Linked