Intrinsic array access is much faster than std::vector access -- Black Magic?

Question

I have set up a test program to compare array access performance to that of std::vector. I have found several similar questions but none seem to address my specific concern. I was scratching my head for some time over why array access seemed to be 6 times faster than vector access, when I have read in the past that they should be equivalent. As it turns out, this seems to be a function of the Intel compiler (v12) and optimization (occurs with anything above -O1), since I see better performance with std::vector when using gcc v4.1.2, and array has only a 2x advantage with gcc v4.4.4. I am running the tests on a RHEL 5.8 machine with Xeon X5355 cores. As an aside, I have found iterators to be faster than element access.

I am compiling with the following commands:

icpc -fast test.cc
g++44 -O3 test.cc

Can anyone explain the dramatic improvement in speed?

#include <vector>
#include <iostream>

using namespace std;

int main() {
  int sz = 100;
  clock_t start,stop;
  int ncycle=1000;
  float temp  = 1.1;

  // Set up and initialize vector
  vector< vector< vector<float> > > A(sz, vector< vector<float> >(sz,  vector<float>(sz, 1.0)));

  // Set up and initialize array
  float*** a = new float**[sz];
  for( int i=0; i<sz; ++i) {
    a[i] = new float*[sz];
    for( int j=0; j<sz; ++j) {
      a[i][j] = new float[sz]();
      for( int k=0; k<sz; ++k)
        a[i][j][k] = 1.0;
    }
  }

  // Time the array
  start = clock();
  for( int n=0; n<ncycle; ++n )
    for( int i=0; i<sz; ++i )
      for( int j=0; j<sz; ++j )
        for( int k=0; k<sz; ++k )
          a[i][j][k] *= temp;

  stop = clock();
  std::cout << "STD ARRAY: " << double((stop - start)) / CLOCKS_PER_SEC << " seconds"     << std::endl;

  // Time the vector
      start = clock();
  /*
  */
  for( int n=0; n < ncycle; ++n )
    for (vector<vector<vector<float> > >::iterator it1 = A.begin(); it1 != A.end();     ++it1)
      for (vector<vector<float> >::iterator it2 = it1->begin(); it2 != it1->end();     ++it2)
        for (vector<float>::iterator it3 =it2->begin(); it3 != it2->end(); ++it3)
          *it3 *= temp;
  /*
     for( int n=0; n < ncycle; ++n )
       for( int i=0; i < sz; ++i )
         for( int j=0; j < sz; ++j )
           for( int k=0; k < sz; ++k )
             A[i][j][k] *= temp;
  */

  stop = clock();
  std::cout << "VECTOR: " << double((stop - start)) / CLOCKS_PER_SEC << " seconds" <<     std::endl;


  for( int i=0; i<100; ++i) {
    for( int j=0; j<100; ++j)
      delete[] a[i][j];
  }
  for( int i=0; i<100; ++i) {
    delete[] a[i];
  }
  delete[] a;
  return 0;
}

SOLVED

After noting Bo's indication that the compiler "knows everything" about the loop and can therefore optimize it more than the vector case, I replaced the multiplications by "temp" with multiplications by a call to "rand()". This leveled the playing field and in fact seems to give std::vector a slight lead. Timing of various scenarios are as follows:

ARRAY (flat): 111.15 seconds
ARRAY (flat): 0.011115 seconds per cycle
ARRAY (3d): 111.73 seconds
ARRAY (3d): 0.011173 seconds per cycle
VECTOR (flat): 110.51 seconds
VECTOR (flat): 0.011051 seconds per cycle
VECTOR (3d): 118.05 seconds
VECTOR (3d): 0.011805 seconds per cycle
VECTOR (flat iterator): 108.55 seconds
VECTOR (flat iterator): 0.010855 seconds per cycle
VECTOR (3d iterator): 111.93 seconds
VECTOR (3d iterator): 0.011193 seconds per cycle

The takeaway seems to be that vectors are just as fast as arrays, and slightly faster when flattened (contiguous memory) and used with iterators. My experiment only averaged over 10,000 iterations, so it could be argued that these are all roughly equivalent and the choice of which to use should be determined by whichever is easiest to use; in my case, that would be the "3d iterator" case.

What are ni/nj/nk initialized with? They aren't in your sample. — Dave S, Aug 23 '12 at 13:38
There should be no difference whatsoever with optimization enabled. — Xeo, Aug 23 '12 at 13:39
A triply nested vector is a terrible thing. Use a 1-D flat vector in strides and marvel at the performance. — Kerrek SB, Aug 23 '12 at 13:40
@Xeo: Yeah, that's true actually. Hmm. Compiler and library debug modes perhaps? — Kerrek SB, Aug 23 '12 at 13:42
@Kerrek, maybe I dont understand what you are implying. I created a 2-d vector and printed out the memory addresses, and they were all contiguous(consecutive). Not saying that is the case here, just pointing it out. EDIT: I made a small 3d test case and you are correct. Is there a way to maintain the conceptual clarity of a 3-d object and also have contiguous memory? — Ethereal, Aug 23 '12 at 13:42
What @Kerrek is saying is that between the inner and outer vectors, the memory isn't contiguous. This wouldn't be the case with a `std::vector v(sz * sz * sz, 1.0);` and manual index management. — Xeo, Aug 23 '12 at 13:44
@ethereal: That may be coincidence (and is actually near impossible). Each nested vector maintains its own, independently allocated memory. — Kerrek SB, Aug 23 '12 at 13:44
@ethereal: Regarding your question: Yes, use a 1-D flat array and access it in strides. — Kerrek SB, Aug 23 '12 at 13:54
Light reading re: Kerrek's advice: http://stackoverflow.com/questions/6465133/stdvector-and-contiguous-memory-of-multidimensional-arrays — Ethereal, Aug 23 '12 at 14:05
On my machine (Ubuntu, Pentium) this program shows `std::vector` is ever so slightly faster than array, with gcc 2.95.4, 4.1.3, 4.3.4, 4.4.3 and 4.7.0. — n. m. could be an AI, Aug 23 '12 at 14:52
@Kerrek, I made a test prog to compare a flat 1d array vs a 3d array, and I am seeing the flat array 2x slower than the 3d array. I cannot include code in comments apparently but a modification of the above code will show this. — Ethereal, Aug 23 '12 at 15:19
@n.m., this may be due to the fact that my processor supports additional instruction sets, SSE3 etc.. This would fit my suspicion that there is optimization at work here that I don't understand. — Ethereal, Aug 23 '12 at 15:22
You can look at the generated assembly and check whether these instructions are used. — n. m. could be an AI, Aug 23 '12 at 15:31
How can vector be faster when there's an additional level of indirection compared with an std::array?? And there's no way its faster using iterators. — user997112, Mar 15 '15 at 17:34

score 3 · Accepted Answer · answered Aug 23 '12 at 15:14

3

There is no black magic here, it is just too easy for the compiler to see that here

for( int n=0; n<ncycle; ++n )
   for( int i=0; i<sz; ++i )
     for( int j=0; j<sz; ++j )
       for( int k=0; k<sz; ++k )
          a[i][j][k] *= temp;

everything is known at compile time. It can easily unroll the loop to speed it up.

answered Aug 23 '12 at 15:14

Bo Persson

90,663
31
146
203

Wouldn't the same be true for the vector loop? – Ethereal Aug 23 '12 at 15:23
@ethereal. A `std::vector` is a template, which makes it not quite as easy. – Xeo Aug 23 '12 at 15:25
Any optimization happens after template instantiation. The optimizer sees code not very much unlike your example. – n. m. could be an AI Aug 23 '12 at 15:38
It's much harder to see that 3 levels of iterators are also constant. With MSVC I can get a 25% speedup using `for (vector::iterator it3 =it2->begin(), end = it2->end(); it3 != end; ++it3)`, so it's sensitive. – Bo Persson Aug 23 '12 at 16:48
@Bo, accepted your answer; replacing temp with rand() prevented compiler optimization and gave a more reasonable comparison. – Ethereal Aug 24 '12 at 14:41

score 1 · Answer 2 · answered Apr 30 '13 at 07:03

The array is just a pointer to an area of memory. You cannot get more direct than that.

A vector needs to call inline functions to do the same thing. Which may or not be really inlined. as inline and __forceinline are just guidances for the compiler to inline your code. The compiler in turn is free to do whatever it wants.

Also, iterator is not necessarily a pointer. The compiler, is calling an inline function which may or not be inlined. (again inline is a guidance to the compiler, not a rule).

When in doubt. Compile to assembly and see the resulting code. If the compiler is doing its job, there should be no noticeable difference. If however, if the compiler is not inlining what is supposed to inline, you will get a huge performance penalty compiler to using vector instead of an array.

score 0 · Answer 3 · answered Apr 30 '13 at 10:06

All elements in an nested array are in contiguous memory locations, so when the compiler encounters an expression of the form a[x][y][z] it generates code which calculates the real index, which involves just integer multiplications and additions. An nested std::vector, on the other hand, is really nested. To expression v[x][y][z], involves two more levels of indirection, because at v[x] there is really a std::vector object which contains a pointer to an array of vectors which in turn contain the real data.

Intrinsic array access is much faster than std::vector access -- Black Magic?

3 Answers3

Linked