13

Performance difference between C++ vectors and plain arrays has been extensively discussed, for example here and here. Usually discussions conclude that vectors and arrays are similar in terms on performance when accessed with the [] operator and the compiler is enabled to inline functions. That is why expected but I came through a case where it seems that is not true. The functionality of the lines below is quite simple: a 3D volume is taken and it is swap and applied some kind of 3D little mask a certain number of times. Depending on the VERSION macro, volumes will be declared as vectors and accessed through the at operator (VERSION=2), declared as vectors and accessed via [] (VERSION=1) or declared as simple arrays.

#include <vector>
#define NX 100
#define NY 100
#define NZ 100
#define H  1
#define C0 1.5f
#define C1 0.25f
#define T 3000

#if !defined(VERSION) || VERSION > 2 || VERSION < 0 
  #error "Bad version"
#endif 

#if VERSION == 2
  #define AT(_a_,_b_) (_a_.at(_b_))
  typedef std::vector<float> Field;
#endif 

#if VERSION == 1
  #define AT(_a_,_b_) (_a_[_b_])
  typedef std::vector<float> Field;
#endif 

#if VERSION == 0
  #define AT(_a_,_b_) (_a_[_b_])
  typedef float* Field;
#endif 

#include <iostream>
#include <omp.h>

int main(void) {

#if VERSION != 0 
  Field img(NX*NY*NY);
#else
  Field img = new float[NX*NY*NY];
#endif 


  double end, begin;
  begin = omp_get_wtime();  

  const int csize = NZ;
  const int psize = NZ * NX;
  for(int t  = 0; t < T; t++ ) {

    /* Swap the 3D volume and apply the "blurring" coefficients */
    #pragma omp parallel for
    for(int j = H; j < NY-H; j++ ) { 
      for( int i = H; i < NX-H; i++ ) {
        for( int k = H; k < NZ-H; k++ ) {
          int eindex = k+i*NZ+j*NX*NZ;
          AT(img,eindex) = C0 * AT(img,eindex) +
              C1 * (AT(img,eindex - csize) +
                    AT(img,eindex + csize) + 
                    AT(img,eindex - psize) + 
                    AT(img,eindex + psize) );
        }
      }
    }
  }

  end = omp_get_wtime();
  std::cout << "Elapsed "<< (end-begin) <<" s." << std::endl;

 /* Access img field so we force it to be deleted after accouting time */
 #define WHATEVER 12.f
 if( img[ NZ ] == WHATEVER ) { 
   std::cout << "Whatever" << std::endl;
 }


#if VERSION == 0
  delete[] img;
#endif 

}

One would expect code will perform the same with VERSION=1 and VERSION=0, but the output is as follows:

  • VERSION 2 : Elapsed 6.94905 s.
  • VERSION 1 : Elapsed 4.08626 s
  • VERSION 0 : Elapsed 1.97576 s.

If I compile without OMP (I've got only two cores), I get similar results:

  • VERSION 2 : Elapsed 10.9895 s.
  • VERSION 1 : Elapsed 7.14674 s
  • VERSION 0 : Elapsed 3.25336 s.

I always compile with GCC 4.6.3 and the compilation options -fopenmp -finline-functions -O3 (I of course remove -fopenmp when I compile without omp) Is there something I do wrong, for example when compiling? Or should we really expect that difference between vectors and arrays?

PS: I cannot use std::array because of the compiler, of which I depend, that doesn't support C11 standard. With ICC 13.1.2 I get similar behavior.

Community
  • 1
  • 1
Genís
  • 1,468
  • 2
  • 13
  • 24
  • 3
    What if you don't use omp? Are you sure it's even legal? – Luchian Grigore Jan 21 '14 at 09:21
  • Good point. Got similar results without omp: 10.9895, 7.14674 and 3.25336 (version 2,1,0 resp.). I'm including it on the question. – Genís Jan 21 '14 at 09:24
  • ideone gives the same results for versions 0 and 1. – Luchian Grigore Jan 21 '14 at 09:29
  • Can you try another compiler or compiler version? GCC 4.1.2 (don't have anything newer available right now) generates different code for 0 and 1, so maybe the optimizer has trouble with the inlined vector functions. Might be fixed in newer GCC, or in Clang/MSVC/ICC. – oliver Jan 21 '14 at 09:37
  • I tried with ICC and I got similar behavior. I also added some lines between accounting time and deallocating data so it gets forcebily freed after accounting time. That didn't seem to make any difference – Genís Jan 21 '14 at 09:40
  • I don't understand your comment. I may have some error, which I think not, but I am not concerned about acesses out of range. It's a matter of performance. – Genís Jan 21 '14 at 09:46
  • 1
    @Genís you should be worried, there's no point in talking about performance if your program exhibits undefined behavior. – Luchian Grigore Jan 21 '14 at 09:51
  • hehe but it does not exhibit that. My goal is to define several debug levels so with, say, with one level fields are accessed through `at`, so I care about possible out of range errors, and with another level, when the code's been properly debugged, fields are accessed with [] operator. – Genís Jan 21 '14 at 09:55
  • Strange. I can't duplicate this on my machine (a Windows box). With both Visual (2012) and g++ (4.7.2), version 1 is the fastest (although the difference between it and version 0 is not really significant---less than 1%). Version 2 is about 60% slower, which means that the compilers aren't succeeding in hoisting the bounds checking outside the loop, which one would normally expect. (For what it's worth: Visual is about 4% faster than g++, regardless of the version.) – James Kanze Jan 21 '14 at 10:34
  • 1
    @molbdnilo I've not verified anything manually, but the fact that he doesn't get an exception in version 2 pretty much proves that there's no out of bounds access. – James Kanze Jan 21 '14 at 10:35
  • @JamesKanze I forgot to take `-H` into account, and the fact that you just mentioned. – molbdnilo Jan 21 '14 at 12:28

1 Answers1

2

I tried your code, used chrono to count the time.

And I compiled with clang (version 3.5) and libc++.

clang++ test.cc -std=c++1y -stdlib=libc++ -lc++abi -finline-functions -O3

The result is exactly same for VERSION 0 and VERSION 1, there's no big difference. They are both 3.4 seconds in average (I use virtual machine so it is slower.).

Then I tried g++ (version 4.8.1),

g++ test.cc -std=c++1y -finline-functions -O3

The result shows that, for VERSION 0, it is 4.4seconds (roughly), for VERSION 1, it is 5.2 seconds (roughly).

I then, tried clang++ with libstdc++.

clang++ test.cc -std=c++11 -finline-functions -O3

voila, the result back to 3.4seconds again.

So, it's purely the optimization "bug" of g++.

user534498
  • 3,926
  • 5
  • 27
  • 52
  • Well I would say the 18% between 4.4 and 5.2 sec. is still important, but obviously not as important as what I got. I also tried with icc and got the same results, but I am installing clang and checking it. If I get what you got, do you think that should be reported to gcc (and icc) people? – Genís Jan 21 '14 at 09:53
  • I think so, because purely reading vectors by [] operator shall have same performance as array. I think gcc should have optimized that, maybe it just can't optimize something like (AT(img,eindex - csize) + AT(img,eindex + csize) + AT(img,eindex - psize) + ...). You shall also notice that, for your code, clang has better performance than gcc (3.4 versus 4.4/5.2), I am not sure how it's achieved for such a simple loop. – user534498 Jan 21 '14 at 10:00
  • @Genís by the way, in my test, I used GCC 4.8.1, that's probably the reason that the gap converges than your testing. – user534498 Jan 21 '14 at 10:03
  • I compiled with clang (without openmp) and I got the following results: 7.2, 4.8 and 4.8. Certainly, versions 1 and 0 perform the same, but I would say they perform one as bad as the other. I think the problem is that whilst ICC and GCC do optimize version 0 clang is not capable of that, so that wouldn't answer my question. By the way, I might be wrong because I don't master it, but doesn't clang use gcc or whichever compiler for actually compiling? – Genís Jan 21 '14 at 10:06
  • clang is completely a new compiler written from scratch. I used clang version 3.5 (I built it myself from latest snapshot 1 month ago) and gcc version 4.8.1. There might be differences between different compilers. You clang has 7.2,4.8,4.8, and your gcc has 10.9, 7.1, 3.2, it's still quite questionable why gcc got so much gap: 7.1 & 3.2. – user534498 Jan 21 '14 at 10:13
  • Ups, I just noticed your comment about gcc 4.8.1. That may be the point. Regrettably I have no easy way to test it with that compiler. I may compile it and try.. :S – Genís Jan 21 '14 at 10:14
  • With gcc-4.9, I get 2.5 (version 0) and 2.7 (version 1) versus 3.1 for clang-3.5 (version 0 or 1) and more that 4 with gcc-4.8. – Marc Glisse Jan 21 '14 at 10:36
  • There's no reason for version 2 to be any slower than the others. The compiler can easily determine the upper and lower limits for the values involved in bounds checking, and so eliminate it entirely. (This is regularly done in compilers for languages which require bounds checking.) – James Kanze Jan 21 '14 at 10:40
  • Indeed, eliminating the range check should be pretty easy, it seems to fail because of http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58742 (and then that prevents vectorization and makes the code 4 times slower). – Marc Glisse Jan 21 '14 at 11:58
  • with gcc 4.8.2 the difference between versions 0 and 1 is reduced, but version 0 still behaves better than 1 (~2.5 versus ~2.0 sec.). Better, but still noticeable. I guess that should improve in future versions of the compilers, as @MarcGlisse noticed with gcc 4.9 – Genís Jan 21 '14 at 12:59