Why does compiling over 100,000 lines of std::vector::push_back take a long time?

Question

I'm compiling a C++ library which defines a single function that randomly samples from a set of data points. The data points are stored in a std::vector. There are 126,272 std::vector push_back statements, where the vector in question is of type double. It is taking a long time to compile.

Why would this take so long? (All the code other than the std::vector push_back statements would take less than 1 second to compile, because there's very little other code.)

Most compilers simply aren't optimized for 100,000+ line files. — Jerry Coffin, Dec 16 '12 at 05:54
It took several minutes on my quad core machine with 8 GB memory. Fortunately, it did successfully compile in the end. — synaptik, Dec 16 '12 at 05:56
Instead of 126,272 individual `push_back` statements, use, u know, a *loop*. — Cheers and hth. - Alf, Dec 16 '12 at 05:59
What compiler? Also, (shot in the dark), try putting, say, every 5000 in a different function, then call each function in turn; this may help the compiler. — Cameron, Dec 16 '12 at 06:00
If that data is an array of constants, why not using the array directly? — didierc, Dec 16 '12 at 06:00
My guess is that since this is one big basic block, the compiler is trying to optimize the basic block as a whole. The optimization time for a basic block is typically not linear with the size of the block. — Vaughn Cato, Dec 16 '12 at 06:09
@Cheersandhth.-Alf Because I have tons of data points and I wish to insert them into a vector, but I don't want the data points to be accessible in a text file, and thus I'm not reading the data from text files. — synaptik, Dec 16 '12 at 06:14
oh. data only in source code, not in file. well try this for c++03: `int const data[] = { ... }; vector v( data, data + sizeof(data)/sizeof(*data) );`, where the dots indicate where the numbers should be specified. the loop here is in the `vector` constructor. — Cheers and hth. - Alf, Dec 16 '12 at 06:19
What is your compiler and which optimization level is used? Is unoptimized build (`-O0`) faster? — osgx, Dec 16 '12 at 10:27
@synaptik _why_ do you not want those to appear in a file? You can encrypt it if you don't want others to be able to read it - anythings better than keeping data points in source really. — Cubic, Dec 25 '12 at 22:44
This looks like a classic XY problem. Nothing good will come of solving the question you have asked. C++ compilers are not designed to embed data - don't write code that has 100,000+ lines of push_back in it. Read the data from a data source and put your push_back in a loop. — quant, Dec 03 '16 at 02:15

score 45 · Accepted Answer · edited May 23 '17 at 10:29

There is the -ftime-report option in gcc which prints the detailed report of time wasted by each compiler phase.

I'm used ubuntu 12.04 64-bit with gcc 4.6.3 and this code to reproduce your situation:

#include <vector>
using namespace std;

int main()
{
  vector<double> d;

 d.push_back(5.7862517058766);
/* ... N lines generated with 
  perl -e 'print(" d.push_back(",rand(10),");\n") for 1..100000'
*/
 d.push_back(3.77195464257674);

  return d.size();
}

There are -ftime-report outputs for various N (wall time was inaccurate because of background load on the PC, so look on user time, usr):

N=10000

$ g++ -ftime-report ./pb10k.cpp

Execution times (seconds)
...
 expand vars           :   1.48 (47%) usr   0.01 ( 7%) sys   1.49 (44%) wall    1542 kB ( 2%) ggc
 expand                :   0.11 ( 3%) usr   0.01 ( 7%) sys   0.10 ( 3%) wall   19187 kB (30%) ggc
...
 TOTAL                 :   3.18             0.15             3.35              64458 kB

N=100000

$ g++ -ftime-report ./pb100k.cpp

Execution times (seconds)
....
 preprocessing         :   0.49 ( 0%) usr   0.28 ( 5%) sys   0.59 ( 0%) wall    6409 kB ( 1%) ggc
 parser                :   0.96 ( 0%) usr   0.39 ( 6%) sys   1.41 ( 0%) wall  108217 kB (18%) ggc
 name lookup           :   0.06 ( 0%) usr   0.07 ( 1%) sys   0.24 ( 0%) wall    1023 kB ( 0%) ggc
 inline heuristics     :   0.13 ( 0%) usr   0.00 ( 0%) sys   0.20 ( 0%) wall       0 kB ( 0%) ggc
 integration           :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall    4095 kB ( 1%) ggc
 tree gimplify         :   0.22 ( 0%) usr   0.00 ( 0%) sys   0.23 ( 0%) wall   36068 kB ( 6%) ggc
 tree eh               :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.14 ( 0%) wall    5678 kB ( 1%) ggc
 tree CFG construction :   0.08 ( 0%) usr   0.01 ( 0%) sys   0.10 ( 0%) wall   38544 kB ( 7%) ggc
....
 expand vars           : 715.98 (97%) usr   1.62 (27%) sys 718.32 (83%) wall   18359 kB ( 3%) ggc
 expand                :   1.04 ( 0%) usr   0.09 ( 1%) sys   1.64 ( 0%) wall  190836 kB (33%) ggc
 post expand cleanups  :   0.09 ( 0%) usr   0.01 ( 0%) sys   0.15 ( 0%) wall      43 kB ( 0%) ggc
....
 rest of compilation   :   1.94 ( 0%) usr   2.56 (43%) sys 102.42 (12%) wall   63620 kB (11%) ggc
 TOTAL                 : 739.68             6.01           866.46             586293 kB

So, there is some extra work for huge N in "expand vars" phase. This phase is exactly in this line: cfgexpand.c:4463 (between TV_VAR_EXPAND macro).

The interesting fact: I have very short compilation times with my custom compiled 32-bit g++ 4.6.2 (~20 sec for N = 100000).

What is the difference between my g++ and ubuntu g++? The one is turning on by default the Gcc Stack Protection (-fstack-protect option) in Ubuntu. And this protection is added just to "expand vars" phase (found in the sources cfgexpand.c:1644,expand_used_vars(); mentioned here):

N=100000, stack protector disabled with option -fno-stack-protector (use it for your code):

$ g++ -ftime-report -fno-stack-protector pb100k.cpp 2>&1 |egrep 'TOTAL|expand vars'
 expand vars           :   0.08 ( 0%) usr   0.01 ( 1%) sys   0.09 ( 0%) wall   18359 kB ( 3%) ggc
 TOTAL                 :  23.05             1.48            24.60             586293 kB

Running time is 24 seconds, down from 800.

UPDATE:

After starting gcc inside callgrind (call-graph profiling tool from Valgrind), I can say that there are N stack variables. If stack protector is enabled, they are handled in "expand vars" phase with three O(N^2) algorithms. Actually there are N^2 successful conflict detections and 1,5 * N^2 bit manipulations done plus some nested loop logic.

Why number of stack variables is so high? Because every double constant in your code is saved to different slot in the stack. Then it is loaded from its slot and passed as calling convention says (via top of stack in x86; via registers in x86_64). The funny fact: all of code with push_backs compiled with -fstack-protector or with -fno-stack-protector is the same; stack layout of constants is same too. Only some stack layout offsets of non-push_back code are affected (checked two runs with -S and diff -u). No additional code was created by enabled stack protector.

Enabling of stack protector fatally changes some behaviour inside the compiler. Can't say where exactly (note: it is possible to find this turning point with comparing of stack traces with callgraph.tar.gz by Juan M. Bello Rivas).

First big N*(N+1)/2 = O(N^2) walk is in expand_used_vars_for_block (tree block, level) function to set info about conflicts between pairs of stack variables:

  /* Since we do not track exact variable lifetimes (which is not even
     possible for variables whose address escapes), we mirror the block
     tree in the interference graph.  Here we cause all variables at this
     level, and all sublevels, to conflict.  */
  if (old_sv_num < this_sv_num)
    {
      new_sv_num = stack_vars_num;

      for (i = old_sv_num; i < new_sv_num; ++i)
        for (j = i < this_sv_num ? i : this_sv_num; j-- > old_sv_num ;)
          add_stack_var_conflict (i, j);
    }
}

The add_stack_var_conflict(i,j) turns to

allocating (once per variable) a bitmap with size of ... oh, with dynamic size which will grow to N bits
setting a bit in i'th var's bitmask for conflict with j
setting a bit in j'th var's bitmask for conflict with i

There is second N^2 walk in add_alias_set_conflicts. It does type checks for every pair with objects_must_conflict_p. It checks, if two variables are of the same type (most pairs are; this is Type-based alias analysis, TBAA). If not, add_stack_var_conflict is called; there is only N such calls from this N^2 loop nest.

Last huge walk is in partition_stack_vars() function with qsorting of stack vars ( O(NlogN) ) and N*(N-1)/2 = O(N^2) walk to find all non-conflicting pairs. Here is pseudocode of partition_stack_vars from cfgexpand.c file:

        Sort the objects by size.
        For each object A {
          S = size(A)
          O = 0
          loop {
            Look for the largest non-conflicting object B with size <= S.
                   /* There is a call to stack_var_conflict_p to check for 
                    * conflict between 2 vars */
            UNION (A, B)
            offset(B) = O
            O += size(B)
            S -= size(B)
          }
        }

Function stack_var_conflict_p just checks is there conflict bitmask in some i-th variable and is there j-th bit set as conflict flag with j-th variable (with call to bitmap_bit_p(i->conflict_mask,j)). The really bad news here is, that callgrind says that every conflict check was successful, and the UNION logic is skipped for every pair.

So, a lot of time is wasted by O(N^2) bit sets and O(N^2/2) bit checks; and all this work don't help to optimize anything. And yes, this all is part of -O0 and triggered by -fstack-protector.

UPDATE2:

Seems, the turning point is expand_one_var cfgexpand.c from 4.6, in the check for immediate or deferred allocating of variable on the stack:

1110      else if (defer_stack_allocation (var, toplevel))
1111        add_stack_var (origvar);
1112      else
1113        {
1114          if (really_expand)
1115            expand_one_stack_var (origvar);
1116          return tree_low_cst (DECL_SIZE_UNIT (var), 1);
1117        }

(expand_one_stack_var was called here only in fast variant, according to callgrind)

Deferred allocation is forced when -fstack-protect is enabled (sometimes it needs to reorder all stack variables). There even a comment about some "quadratic problem", which is looks too familiar to us now:

969 /* A subroutine of expand_one_var.  VAR is a variable that will be
970    allocated to the local stack frame.  Return true if we wish to
971    add VAR to STACK_VARS so that it will be coalesced with other
972    variables.  Return false to allocate VAR immediately.
973 
974    This function is used to reduce the number of variables considered
975    for coalescing, which reduces the size of the quadratic problem.  */
976 
977 static bool
978 defer_stack_allocation (tree var, bool toplevel)
979 {
980   /* If stack protection is enabled, *all* stack variables must be deferred,
981      so that we can re-order the strings to the top of the frame.  */
982   if (flag_stack_protect)
983     return true;

(stack allocation is deferred too at -O2 and greater)

Here is a commit: http://gcc.gnu.org/ml/gcc-patches/2005-05/txt00029.txt which added this logic.

@Nemo: I fail to see how it's a performance bug if insane code takes an insane amount of time to compile. If reasonable code takes unreasonable time, that's a different story, but 100k calls to `push_back` really deserve being slow. I'd rather have the GCC developers focus on something useful than on making code that is ill-formed compile better. — Damon, Mar 01 '13 at 19:20
@Damon: First, "ill-formed" is a technical term in C++ -- defined by the standard -- and you are using it incorrectly. This program is perfectly well-formed. Second, you are not the arbiter of what is "insane"; not everyone's needs are identical to yours. If simple linear code induces O(n^2) behavior in the compiler, then I call that a performance bug in the compiler. I highly doubt I am alone. — Nemo, Mar 02 '13 at 00:26
Ah, I knew that you'd object to "ill-formed". But I insist, a program that has 100k `push_back`s for initializing a `vector` with constant data is ill-formed, even if it is _technically_ correct and what the specification allows. The compiler is certainly required to correctly process such code, but I think it is very allowable and acceptable if bad input results in bad compiler performance. Similarly, if you use an embarrassingly bad (but correct) algorithm, it is acceptable if the resulting binary performs badly -- you cannot blame the compiler for that (well, in my opinion, at least). — Damon, Mar 02 '13 at 16:30

score 5 · Answer 2 · answered Dec 26 '12 at 00:17

This question was completely answered by the great answer from osgx.

Maybe one additional aspect: push_back() vs initialization list

When running the above test with 100000 push_backs, I get the following result with a gcc 4.4.6 on a Debian 6.0.6 system:

$ time g++ -std=c++0x -ftime-report ./pb100k.cc 

Execution times (seconds)
 garbage collection    :   0.55 ( 1%) usr   0.00 ( 0%) sys   0.55 ( 1%) wall       0 kB ( 0%) ggc
 ...
 reload                :  33.95 (58%) usr   0.13 ( 6%) sys  34.14 (56%) wall   65723 kB ( 9%) ggc
 thread pro- & epilogue:   0.66 ( 1%) usr   0.00 ( 0%) sys   0.66 ( 1%) wall      84 kB ( 0%) ggc
 final                 :   1.82 ( 3%) usr   0.01 ( 0%) sys   1.81 ( 3%) wall      21 kB ( 0%) ggc
 TOTAL                 :  58.65             2.13            60.92             737584 kB

real    1m2.804s
user    1m0.348s
sys     0m2.328s

When using an initialization list, it is much faster:

$ cat pbi100k.cc
#include <vector>
using namespace std;

int main()
{
   vector<double> d {
   0.190987822870774,
   /* 100000 lines with doubles generated with:
          perl -e 'print(rand(10),",\n") for 1..100000'
   */
   7.45608614801021};

  return d.size();
}

$ time g++ -std=c++0x -ftime-report ./pbi100k.cc 

Execution times (seconds)
 callgraph construction:   0.02 ( 2%) usr   0.00 ( 0%) sys   0.02 ( 1%) wall      25 kB ( 0%) ggc
 preprocessing         :   0.72 (59%) usr   0.06 (25%) sys   0.80 (54%) wall    8004 kB (12%) ggc
 parser                :   0.24 (20%) usr   0.12 (50%) sys   0.36 (24%) wall   43185 kB (65%) ggc
 name lookup           :   0.01 ( 1%) usr   0.05 (21%) sys   0.03 ( 2%) wall    1447 kB ( 2%) ggc
 tree gimplify         :   0.01 ( 1%) usr   0.00 ( 0%) sys   0.02 ( 1%) wall     277 kB ( 0%) ggc
 tree find ref. vars   :   0.01 ( 1%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall      15 kB ( 0%) ggc
 varconst              :   0.19 (15%) usr   0.01 ( 4%) sys   0.20 (14%) wall   11288 kB (17%) ggc
 integrated RA         :   0.02 ( 2%) usr   0.00 ( 0%) sys   0.02 ( 1%) wall      74 kB ( 0%) ggc
 reload                :   0.01 ( 1%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall      61 kB ( 0%) ggc
 TOTAL                 :   1.23             0.24             1.48              66378 kB

real    0m1.701s
user    0m1.416s
sys     0m0.276s

This is about 30+ times faster!

score 0 · Answer 3 · answered Dec 25 '12 at 22:41

0

I believe the long time relates to vector being a template. The compiler needs to rewrite every occurance of push_back with the corresponding function. It's like having many overloaded functions, where the compile needs to do name mangling to address the correct function. This is an extra work compared to simply compiling non-overloaded functions.

answered Dec 25 '12 at 22:41

Israel Unterman

13,158
4
28
35

2

Which phases of compiler need to do a lot job for working with template? The `parser` and `name lookup`? But in my results of `-ftime-report`, both phases takes < 2 seconds of wall time. – osgx Dec 25 '12 at 22:45

Why does compiling over 100,000 lines of std::vector::push_back take a long time?

3 Answers3

Linked