vector faster than double*: why?

Question

Here's a loop that I've tried with std::vector<double> and with plain old double*.

For 10 million elements, the vector version consistently runs in about 80% of the time that the double* version takes; for pretty much any value of N, vector is notably faster.

Peeking at the GCC STL source code, I don't see that std::vector is doing anything essentially fancier than what the double* idiom is doing (i.e., allocate with plain old new[], operator[] dereferences an offset). This question speaks to that, too.

Any ideas why the vector version is faster?

Compiler: GCC 4.6.1
Example compile line: g++ -Ofast -march=native -DNDEBUG \
                      -ftree-vectorizer-verbose=2 -o vector.bin \
                      vector.cpp -lrt
OS: CentOS 5
CPU: Opteron 8431
RAM: 128 GB

Results are qualitatively the same if I use icpc 11.1 or run on a Xeon. Also, the vectorizer dump says that only the fill operation in std::vector's constructor was vectorized.

The vector version:

#include <vector>
#include <iostream>
#include <boost/lexical_cast.hpp>
#include "util.h"
#include "rkck_params.h"

using namespace std;

int main( int argc, char* argv[] )
{
    const size_t N = boost::lexical_cast<size_t>( argv[ 1 ] );

    vector<double> y_old( N );
    vector<double> y_new( N );
    vector<double> y_err( N );
    vector<double> k0( N );
    vector<double> k1( N );
    vector<double> k2( N );
    vector<double> k3( N );
    vector<double> k4( N );
    vector<double> k5( N );

    const double h = 0.5;

    const timespec start = clock_tick();
    for ( size_t i = 0 ; i < N ; ++i )
    {
        y_new[ i ] =   y_old[ i ]
                     + h
                      *(
                           rkck::c[ 0 ]*k0[ i ]
                         + rkck::c[ 2 ]*k2[ i ]
                         + rkck::c[ 3 ]*k3[ i ]
                         + rkck::c[ 5 ]*k5[ i ]
                       );
        y_err[ i ] =  h
                     *(
                          rkck::cdiff[ 0 ]*k0[ i ]
                        + rkck::cdiff[ 2 ]*k2[ i ]
                        + rkck::cdiff[ 3 ]*k3[ i ]
                        + rkck::cdiff[ 4 ]*k4[ i ]
                        + rkck::cdiff[ 5 ]*k5[ i ]
                      );
    }
    const timespec stop = clock_tick();
    const double total_time = seconds( start, stop );

    // Output
    cout << "vector\t" << N << "\t" << total_time << endl;

    return 0;
}

The double* version:

#include <iostream>
#include <boost/lexical_cast.hpp>
#include "util.h"
#include "rkck_params.h"

using namespace std;

int main( int argc, char* argv[] )
{
    const size_t N = boost::lexical_cast<size_t>( argv[ 1 ] );

    double* y_old = new double[ N ];
    double* y_new = new double[ N ];
    double* y_err = new double[ N ];
    double* k0 = new double[ N ];
    double* k1 = new double[ N ];
    double* k2 = new double[ N ];
    double* k3 = new double[ N ];
    double* k4 = new double[ N ];
    double* k5 = new double[ N ];

    const double h = 0.5;

    const timespec start = clock_tick();
    for ( size_t i = 0 ; i < N ; ++i )
    {
        y_new[ i ]
            =   y_old[ i ]
              + h
               *(
                    rkck::c[ 0 ]*k0[ i ]
                  + rkck::c[ 2 ]*k2[ i ]
                  + rkck::c[ 3 ]*k3[ i ]
                  + rkck::c[ 5 ]*k5[ i ]
                );
        y_err[ i ]
            =  h
              *(
                   rkck::cdiff[ 0 ]*k0[ i ]
                 + rkck::cdiff[ 2 ]*k2[ i ]
                 + rkck::cdiff[ 3 ]*k3[ i ]
                 + rkck::cdiff[ 4 ]*k4[ i ]
                 + rkck::cdiff[ 5 ]*k5[ i ]
               );
    }
    const timespec stop = clock_tick();
    const double total_time = seconds( start, stop );

    delete [] y_old;
    delete [] y_new;
    delete [] y_err;
    delete [] k0;
    delete [] k1;
    delete [] k2;
    delete [] k3;
    delete [] k4;
    delete [] k5;

    // Output
    cout << "plain\t" << N << "\t" << total_time << endl;

    return 0;
}

rkck_params.h:

#ifndef RKCK_PARAMS_H
#define RKCK_PARAMS_H

namespace rkck
{

// C.f. $c_i$ in Ch. 16.2 of NR in C++, 2nd ed.
const double c[ 6 ]
    = { 37.0/378.0,
        0.0,
        250.0/621.0,
        125.0/594,
        0.0,
        512.0/1771.0 };

// C.f. $( c_i - c_i^* )$ in Ch. 16.2 of NR in C++, 2nd ed.
const double cdiff[ 6 ]
    = { c[ 0 ] - 2825.0/27648.0,
        c[ 1 ] - 0.0,
        c[ 2 ] - 18575.0/48384.0,
        c[ 3 ] - 13525.0/55296.0,
        c[ 4 ] - 277.0/14336.0,
        c[ 5 ] - 1.0/4.0 };

}

#endif

util.h:

#ifndef UTIL_H
#define UTIL_H

#include <time.h>
#include <utility>

inline timespec clock_tick()
{
    timespec tick;
    clock_gettime( CLOCK_REALTIME, &tick );
    return tick;
}

// \cite{www.guyrutenberg.com/2007/09/22/profiling-code-using-clock_gettime}
inline double seconds( const timespec& earlier, const timespec& later )
{
    double seconds_diff = -1.0;
    double nano_diff = -1.0;

    if ( later.tv_nsec < earlier.tv_nsec )
    {
        seconds_diff = later.tv_sec - earlier.tv_sec - 1;
        nano_diff = ( 1.0e9 + later.tv_nsec - earlier.tv_nsec )*1.0e-9;
    }
    else
    {
        seconds_diff = later.tv_sec - earlier.tv_sec;
        nano_diff = ( later.tv_nsec - earlier.tv_nsec )*1.0e-9;
    }

    return seconds_diff + nano_diff;
}

#endif

http://punchlet.wordpress.com/2011/07/01/letter-the-eighth-bedevilling-benchmarks/ — R. Martinho Fernandes, Jul 14 '11 at 19:31
@Martinho: I very much agree that realistic test cases for profiling are crucial if one's results are to have any hope of being meaningful. — Hector, Jul 14 '11 at 19:44
possible duplicate of [std::vector is so much slower than plain arrays?](http://stackoverflow.com/questions/3664272/stdvector-is-so-much-slower-than-plain-arrays) — Martin York, Jul 14 '11 at 20:01
@MSalters: I was assuming that people were smart enough to invert the question to get to the same answer. — Martin York, Jul 15 '11 at 15:49

score 6 · Accepted Answer · answered Jul 14 '11 at 19:31

6

In the vector version your data is initialized to zero. In the new version it's uninitialized, so different work might be done.

Did you run multiple times, in different orders?

answered Jul 14 '11 at 19:31

Mark B

95,107
10
109
188

sounds right: I reproduced OPs results on my system (vector was 50% faster), and changing all `new[]`s into `new[]()`s dropped the difference to 1 - 5%. – Cubbi Jul 14 '11 at 19:46
Is the multiply instruction timing data-dependent? I thought that as it got faster over the years, it got more consistent as well. – Mark Ransom Jul 14 '11 at 19:46
1

@Mark Ransom It's not impossible that multiplying by zero could be a lot faster than multiplying by random numbers. – Mark B Jul 14 '11 at 19:51
Initializing the arrays and vectors brings double*'s time down just under vector's, for me. I had forgotten that `new[]` on built-in types doesn't do any initialization. I would hypothesize that the `vector(size_t)` constructor running over all the vector elements brings them closer into memory (I don't know what I mean by that; maybe many are still in the cache?) than the initially untraversed double* arrays. – Hector Jul 14 '11 at 20:00
@Hector the `vector(size_t, T = T())` constructor already initializes the values to 0 automatically, so I believe that's the difference, not anything to do with memory locality. – Mark B Jul 14 '11 at 20:12
@Mark B: In my subsequent test, I actually initialized both the vector and double* arrays with identical nonzero data (`y_old[i]=i; y_new[i]=i+1;` etc.) and saw vector's speed remain the same, while double* got faster. (Initialization is outside the timer calls, of course.) Trying to compare apples to apples as much as possible. – Hector Jul 14 '11 at 20:22
1

@Hector, you never said what value you're using for `N` but if it's small enough to fit in cache, "priming" the cache via initialization would definitely make it faster. Makes much more sense than my speculation about the multiply instruction. – Mark Ransom Jul 14 '11 at 20:30
@Mark Ransom: By `N` I mean the number of elements in each array; my originally cited comparison was for `N` = 10 million, but the difference was apparent for quite a wide range. Ten million probably isn't what you mean by small...maybe I'm on the wrong track with the memory thing. – Hector Jul 14 '11 at 20:37
@Mark Ransom: His second sentence says "for 10 million elements". Not likely to be in cache. However, the kernel is not even going to allocate the page until you touch it for the first time, so maybe touching them all at once is somehow more friendly to the TLB (?). Interesting test case. – Nemo Jul 14 '11 at 20:40
1

@Nemo, I think you're on to something - if the memory is allocated but not physically mapped, mapping it is going to take some time. I'm starting to regret all the times I've given the advice "measure it" when measuring appears to be so hard! – Mark Ransom Jul 14 '11 at 21:22

vector faster than double*: why?

1 Answers1