Why is std::pair faster than std::tuple

Question

Here is the code for testing.

Tuple test:

using namespace std;

int main(){

    vector<tuple<int,int>> v;


    for (int var = 0; var < 100000000; ++var) {
        v.push_back(make_tuple(var, var));
    }
}

Pair test:

#include <vector>

using namespace std;

int main(){

    vector<pair<int,int>> v;


    for (int var = 0; var < 100000000; ++var) {
        v.push_back(make_pair(var, var));
    }
}

I did the time measurement via Linux time command. The results are:

|       |   -O0   |    -O2   |
|:------|:-------:|:--------:|
| Pair  |   8.9 s |  1.60 s  |
| Tuple |  19.8 s |  1.96 s  |

I am wondering, why is such a big difference between those two data structures in O0, as they should be very similar. There is just a small difference in 02.

Why is the difference in O0 so big, and why is there any difference at all?

EDIT:

The code with v.resize()

Pair:

#include <vector>

using namespace std;

int main(){

    vector<pair<int,int>> v;

    v.resize(100000000);

    for (int var = 0; var < 100000000; ++var) {
        v[var] = make_pair(var, var);
    }
}

Tuple:

#include<tuple>
#include<vector>

using namespace std;

int main(){

    vector<tuple<int,int>> v;

    v.resize(100000000);

    for (int var = 0; var < 100000000; ++var) {
        v[var] = make_tuple(var, var);
    }
}

Results:

|       |   -O0   |    -O2   |
|:------|:-------:|:--------:|
| Pair  |  5.01 s |  0.77 s  |
| Tuple |  10.6 s |  0.87 s  |

EDIT:

My system

g++ (GCC) 4.8.3 20140911 (Red Hat 4.8.3-7)
GLIBCXX_3.4.19

I would say you should do `v.reserve(100000000);` before the loops in both case to make it a more accurate test. — Jonathan Potter, Nov 11 '14 at 11:37
@JonathanPotter but if that resolves the difference, then it would show construction is slower. — gbjbaanb, Nov 11 '14 at 11:42
Measuring performance with `-O0` is pointless - just compare optimised code when benchmarking. — Paul R, Nov 11 '14 at 11:43
For performance measurement you should run and measure your program not only once but x times and then take the mean (or median) of your measured runtimes. Otherwise you can always have that one system call which somehow alters your measurements in a non-determinable way. — Sorin Totuarez, Nov 11 '14 at 11:57
First make sure you are measuring times correctly: http://stackoverflow.com/a/9006802/412080 — Maxim Egorushkin, Nov 11 '14 at 11:57
I was timming with, the unix time command. I tried -O3 with perf stat, and it turned out that pair and tuple performed the same on my system. — , Nov 11 '14 at 12:33
@gbjbaanb: `reserve` doesn't do construction, only memory allocation. It would remove the memory allocation component of the test. Your updated code shows `resize` which is not what I suggested. — Jonathan Potter, Nov 11 '14 at 19:07

score 71 · Accepted Answer · edited May 23 '17 at 11:54

You are missing some crucial information: What compiler do you use? What do you use to measure the performance of the microbenchmark? What standard library implementation do you use?

My system:

g++ (GCC) 4.9.1 20140903 (prerelease)
GLIBCXX_3.4.20

Anyhow, I ran your examples, but reserved the proper size of the vectors first to get rid of the memory allocation overhead. With that, I funnily observe the opposite something interesting - the reverse of what you see:

g++ -std=c++11 -O2 pair.cpp -o pair
perf stat -r 10 -d ./pair
Performance counter stats for './pair' (10 runs):

      1647.045151      task-clock:HG (msec)      #    0.993 CPUs utilized            ( +-  1.94% )
              346      context-switches:HG       #    0.210 K/sec                    ( +- 40.13% )
                7      cpu-migrations:HG         #    0.004 K/sec                    ( +- 22.01% )
          182,978      page-faults:HG            #    0.111 M/sec                    ( +-  0.04% )
    3,394,685,602      cycles:HG                 #    2.061 GHz                      ( +-  2.24% ) [44.38%]
    2,478,474,676      stalled-cycles-frontend:HG #   73.01% frontend cycles idle     ( +-  1.24% ) [44.55%]
    1,550,747,174      stalled-cycles-backend:HG #   45.68% backend  cycles idle     ( +-  1.60% ) [44.66%]
    2,837,484,461      instructions:HG           #    0.84  insns per cycle        
                                                  #    0.87  stalled cycles per insn  ( +-  4.86% ) [55.78%]
      526,077,681      branches:HG               #  319.407 M/sec                    ( +-  4.52% ) [55.82%]
          829,623      branch-misses:HG          #    0.16% of all branches          ( +-  4.42% ) [55.74%]
      594,396,822      L1-dcache-loads:HG        #  360.887 M/sec                    ( +-  4.74% ) [55.59%]
        20,842,113      L1-dcache-load-misses:HG  #    3.51% of all L1-dcache hits    ( +-  0.68% ) [55.46%]
        5,474,166      LLC-loads:HG              #    3.324 M/sec                    ( +-  1.81% ) [44.23%]
  <not supported>      LLC-load-misses:HG       

      1.658671368 seconds time elapsed                                          ( +-  1.82% )

versus:

g++ -std=c++11 -O2 tuple.cpp -o tuple
perf stat -r 10 -d ./tuple
Performance counter stats for './tuple' (10 runs):

        996.090514      task-clock:HG (msec)      #    0.996 CPUs utilized            ( +-  2.41% )
              102      context-switches:HG       #    0.102 K/sec                    ( +- 64.61% )
                4      cpu-migrations:HG         #    0.004 K/sec                    ( +- 32.24% )
          181,701      page-faults:HG            #    0.182 M/sec                    ( +-  0.06% )
    2,052,505,223      cycles:HG                 #    2.061 GHz                      ( +-  2.22% ) [44.45%]
    1,212,930,513      stalled-cycles-frontend:HG #   59.10% frontend cycles idle     ( +-  2.94% ) [44.56%]
      621,104,447      stalled-cycles-backend:HG #   30.26% backend  cycles idle     ( +-  3.48% ) [44.69%]
    2,700,410,991      instructions:HG           #    1.32  insns per cycle        
                                                  #    0.45  stalled cycles per insn  ( +-  1.66% ) [55.94%]
      486,476,408      branches:HG               #  488.386 M/sec                    ( +-  1.70% ) [55.96%]
          959,651      branch-misses:HG          #    0.20% of all branches          ( +-  4.78% ) [55.82%]
      547,000,119      L1-dcache-loads:HG        #  549.147 M/sec                    ( +-  2.19% ) [55.67%]
        21,540,926      L1-dcache-load-misses:HG  #    3.94% of all L1-dcache hits    ( +-  2.73% ) [55.43%]
        5,751,650      LLC-loads:HG              #    5.774 M/sec                    ( +-  3.60% ) [44.21%]
  <not supported>      LLC-load-misses:HG       

      1.000126894 seconds time elapsed                                          ( +-  2.47% )

as you can see, in my case the reason are the much higher number of stalled cycles, both in the frontend as well as in the backend.

Now where does this come from? I bet it comes down to some failed inlining, similar to what is explained here: std::vector performance regression when enabling C++11

Indeed, enabling -flto equalizes the results for me:

Performance counter stats for './pair' (10 runs):

      1021.922944      task-clock:HG (msec)      #    0.997 CPUs utilized            ( +-  1.15% )
                63      context-switches:HG       #    0.062 K/sec                    ( +- 77.23% )
                5      cpu-migrations:HG         #    0.005 K/sec                    ( +- 34.21% )
          195,396      page-faults:HG            #    0.191 M/sec                    ( +-  0.00% )
    2,109,877,147      cycles:HG                 #    2.065 GHz                      ( +-  0.92% ) [44.33%]
    1,098,031,078      stalled-cycles-frontend:HG #   52.04% frontend cycles idle     ( +-  0.93% ) [44.46%]
      701,553,535      stalled-cycles-backend:HG #   33.25% backend  cycles idle     ( +-  1.09% ) [44.68%]
    3,288,420,630      instructions:HG           #    1.56  insns per cycle        
                                                  #    0.33  stalled cycles per insn  ( +-  0.88% ) [55.89%]
      672,941,736      branches:HG               #  658.505 M/sec                    ( +-  0.80% ) [56.00%]
          660,278      branch-misses:HG          #    0.10% of all branches          ( +-  2.05% ) [55.93%]
      474,314,267      L1-dcache-loads:HG        #  464.139 M/sec                    ( +-  1.32% ) [55.73%]
        19,481,787      L1-dcache-load-misses:HG  #    4.11% of all L1-dcache hits    ( +-  0.80% ) [55.51%]
        5,155,678      LLC-loads:HG              #    5.045 M/sec                    ( +-  1.69% ) [44.21%]
  <not supported>      LLC-load-misses:HG       

      1.025083895 seconds time elapsed                                          ( +-  1.03% )

and for tuple:

Performance counter stats for './tuple' (10 runs):

      1018.980969      task-clock:HG (msec)      #    0.999 CPUs utilized            ( +-  0.47% )
                8      context-switches:HG       #    0.008 K/sec                    ( +- 29.74% )
                3      cpu-migrations:HG         #    0.003 K/sec                    ( +- 42.64% )
          195,396      page-faults:HG            #    0.192 M/sec                    ( +-  0.00% )
    2,103,574,740      cycles:HG                 #    2.064 GHz                      ( +-  0.30% ) [44.28%]
    1,088,827,212      stalled-cycles-frontend:HG #   51.76% frontend cycles idle     ( +-  0.47% ) [44.56%]
      697,438,071      stalled-cycles-backend:HG #   33.15% backend  cycles idle     ( +-  0.41% ) [44.76%]
    3,305,631,646      instructions:HG           #    1.57  insns per cycle        
                                                  #    0.33  stalled cycles per insn  ( +-  0.21% ) [55.94%]
      675,175,757      branches:HG               #  662.599 M/sec                    ( +-  0.16% ) [56.02%]
          656,205      branch-misses:HG          #    0.10% of all branches          ( +-  0.98% ) [55.93%]
      475,532,976      L1-dcache-loads:HG        #  466.675 M/sec                    ( +-  0.13% ) [55.69%]
        19,430,992      L1-dcache-load-misses:HG  #    4.09% of all L1-dcache hits    ( +-  0.20% ) [55.49%]
        5,161,624      LLC-loads:HG              #    5.065 M/sec                    ( +-  0.47% ) [44.14%]
  <not supported>      LLC-load-misses:HG       

      1.020225388 seconds time elapsed                                          ( +-  0.48% )

So remember, -flto is your friend and failed inlining can have extreme results on heavily templated code. Use perf stat to find out what's happening.

PS: I assume pair is slower as its probably implemented using tuple and just so happens to hit the inline depth limit. — milianw, Nov 11 '14 at 12:03
Wrong assumption. `std::pair` is required to have two true data members called `first` and `second`, so it can't be implemented in terms of anything else. — Sebastian Redl, Nov 11 '14 at 12:17
`pair` was introduced into C++ before `tuple` appears in C++11, so it can't be implemented by tuple. Moreover, whoever use a more complex structure for a simple one? — phuclv, Nov 12 '14 at 05:25
@LưuVĩnhPhúc `pair` still can be implemented with `tuple`. I suppose the code is constantly rewritten. For example, C++ compiler *is* written in C++ :) — Slaus, Aug 15 '17 at 15:55

score 41 · Answer 2 · edited Nov 13 '21 at 11:26

41

milianw didn't address the -O0 vs. -O2, so I'd like to add explanation for that.

It is fully expected that std::tuple will be slower than std::pair when not optimized, because it is more complicated object. A pair has exactly two members, so its methods are straightforward to define. But tuple has arbitrary number of members and the only way to iterate over template argument list is with recursion. Therefore most functions for tuple handle one member and then recurse to handle the rest, so for 2-tuple you have twice as many function calls.

Now, when they are optimized, the compiler will inline that recursion and there should not be significant difference. Which the tests clearly confirm. That applies to heavily templated stuff in general. Templates can be written to provide abstraction with no or very little runtime overhead, but that relies on optimizations to inline all the trivial functions.

edited Nov 13 '21 at 11:26

user1095108

14,119
9
58
116

answered Nov 11 '14 at 13:04

Jan Hudec

73,652
13
125
172

I would have answered along the same line of reason (arbitrary number of arguments), then I saw that std::tuple has an extra constructor for pairs (http://www.cplusplus.com/reference/tuple/tuple/tuple/ (5)), which should prevent the argument list iteration. – Sorin Totuarez Nov 11 '14 at 13:18
6

@SorinTotuarez: That constructor is not being used here. And even it can't avoid the recursion as the structure of the tuple itself is recursive. – Jan Hudec Nov 11 '14 at 13:32
Is this required by the language? I would have thought `std::tuple` might be _specified_ in the standard that way but an implementation could do what it wants. E.g., have specializations for all common (<9 element) tuple sizes. – davidbak Oct 24 '17 at 03:50

Why is std::pair faster than std::tuple

2 Answers2

Linked