Why is C array so much faster than std::array?

Question

We are currently writing some performance critical code in C++ that operates on many large matrices and vectors. Regarding to our research, there should be no major performance difference between std::array and standard C arrays (See This question or this). However, while testing we experienced a huge performance improvement by using C arrays over std::array. This is our demo code:

#include <iostream>
#include <array>
#include <sys/time.h>

#define ROWS 784
#define COLS 100
#define RUNS 50

using std::array;

void DotPComplex(array<double, ROWS> &result, array<double, ROWS> &vec1, array<double, ROWS> &vec2){
  for(int i = 0; i < ROWS; i++){
    result[i] = vec1[i] * vec2[i];
  }
}

void DotPSimple(double result[ROWS], double vec1[ROWS], double vec2[ROWS]){
  for(int i = 0; i < ROWS; i++){
    result[i] = vec1[i] * vec2[i];
  }
}

void MatMultComplex(array<double, ROWS> &result, array<array<double, COLS>, ROWS> &mat, array<double, ROWS> &vec){
  for (int i = 0; i < COLS; ++i) {
      for (int j = 0; j < ROWS; ++j) {
        result[i] += mat[i][j] * vec[j];
      }
  }
}

void MatMultSimple(double result[ROWS], double mat[ROWS][COLS], double vec[ROWS]){
  for (int i = 0; i < COLS; ++i) {
      for (int j = 0; j < ROWS; ++j) {
        result[i] += mat[i][j] * vec[j];
      }
  }
}

double getTime(){
    struct timeval currentTime;
    gettimeofday(&currentTime, NULL);
    double tmp = (double)currentTime.tv_sec * 1000.0 + (double)currentTime.tv_usec/1000.0;
    return tmp;
}

array<double, ROWS> inputVectorComplex = {{ 0 }};
array<double, ROWS> resultVectorComplex = {{ 0 }};
double inputVectorSimple[ROWS] = { 0 };
double resultVectorSimple[ROWS] = { 0 };

array<array<double, COLS>, ROWS> inputMatrixComplex = {{0}};
double inputMatrixSimple[ROWS][COLS] = { 0 };

int main(){
  double start;
  std::cout << "DotP test with C array: " << std::endl;
  start = getTime();
  for(int i = 0; i < RUNS; i++){
    DotPSimple(resultVectorSimple, inputVectorSimple, inputVectorSimple);
  }
  std::cout << "Duration: " << getTime() - start << std::endl;

  std::cout << "DotP test with C++ array: " << std::endl;
  start = getTime();
  for(int i = 0; i < RUNS; i++){
    DotPComplex(resultVectorComplex, inputVectorComplex, inputVectorComplex);
  }
  std::cout << "Duration: " << getTime() - start << std::endl;

  std::cout << "MatMult test with C array : " << std::endl;
  start = getTime();
  for(int i = 0; i < RUNS; i++){
    MatMultSimple(resultVectorSimple, inputMatrixSimple, inputVectorSimple);
  }
  std::cout << "Duration: " << getTime() - start << std::endl;

  std::cout << "MatMult test with C++ array: " << std::endl;
  start = getTime();
  for(int i = 0; i < RUNS; i++){
    MatMultComplex(resultVectorComplex, inputMatrixComplex, inputVectorComplex);
  }
  std::cout << "Duration: " << getTime() - start << std::endl;
}

Compiled with: icpc demo.cpp -std=c++11 -O0 This is the result:

DotP test with C array: 
Duration: 0.289795 ms
DotP test with C++ array: 
Duration: 1.98413 ms
MatMult test with C array : 
Duration: 28.3459 ms
MatMult test with C++ array: 
Duration: 175.15 ms

With -O3 flag:

DotP test with C array: 
Duration: 0.0280762 ms
DotP test with C++ array: 
Duration: 0.0288086 ms
MatMult test with C array : 
Duration: 1.78296 ms
MatMult test with C++ array: 
Duration: 4.90991 ms

The C array implementation is much faster without compiler optimization. Why? Using compiler optimizations the dot products are equally fast. But for the matrix multiplication there is still a significant speedup when using C arrays. Is there a way to achieve equal performance when using std::array?

Update:

The compiler used: icpc 17.0.0

With gcc 4.8.5 our code runs much slower than with the intel compiler using any optimization level. Therefore we are mainly interested in the behaviour of the intel compiler.

As suggested by Jonas we adjusted RUNS 50.000 with the following results (intel compiler):

With -O0 flag:

DotP test with C array: 
Duration: 201.764 ms
DotP test with C++ array: 
Duration: 1020.67 ms
MatMult test with C array : 
Duration: 15069.2 ms
MatMult test with C++ array: 
Duration: 123826 ms

With -O3 flag:

DotP test with C array: 
Duration: 16.583 ms
DotP test with C++ array: 
Duration: 15.635 ms
MatMult test with C array : 
Duration: 980.582 ms
MatMult test with C++ array: 
Duration: 2344.46 ms

With GCC 6.3.1 I get 0.0100098 vs 0.0100098 for DotP and 3.08594 vs 3.11499 for MatMult. I guess GCC is just better at optimising things here? — Uli Schlachter, Apr 03 '17 at 11:09
Also, you synthetic benchmark is not really useful. You are mostly measuring if the compiler can figure out that the result will unused (in which case all the computations can be left out) or constant (in which case the compilter could just do all the computations at compile time). In both cases, you aren't measuring anything useful. — Uli Schlachter, Apr 03 '17 at 11:11
With gcc 4.8.5 we get similar results but icpc is faster in general. — The Floe, Apr 03 '17 at 11:13
@UliSchlachter That explains some of the optimised results but not the matrix multiplication or the unoptimised results. — JeremyP, Apr 03 '17 at 11:13
Because you are paying an abstraction penalty because of function calls which aren't being inlined because you've _disabled optimisation_? — davmac, Apr 03 '17 at 11:15
Enable optimisation if you want the compiler to optimise away abstractions. — Uli Schlachter, Apr 03 '17 at 11:16
On https://godbolt.org/g/9MnTLs, both `MatMultComplex` and `MatMultSimple` appear to produce identical assembly — eerorika, Apr 03 '17 at 11:16
Add one more test `MatMultCompromise` where `mat` parameter is `array&`, and see what happens. — Dialecticus, Apr 03 '17 at 11:16
Does the timing change if you change the order of the tests? — Useless, Apr 03 '17 at 11:21
FWIW I see equivalent times in `MatMult{Complex,Simple}` on my desktop with GCC 6.3. And equivalent - but much better - times with LLVM 3.9.1 also. — davmac, Apr 03 '17 at 11:24
With optimization -O3 and gcc 5.8, C++ array is faster than C array for both DotP and MatMult. — XZ6H, Apr 03 '17 at 11:32
@UliSchlachter I figured that you are right: The unused results produced the difference for O3. — The Floe, Apr 03 '17 at 12:43

score 22 · Answer 1 · edited Jun 20 '20 at 09:12

First up, the amount of runs you are using is simply too few. Personally, I did not realize (before running the code) that your "Duration" measurements are in miliseconds

By increasing the RUNS to 5,000,000 for DotPSimple and DotPComplex the timing are something like:

DotP test with C array:

Duration: 1074.89

DotP test with C++ array:

Duration: 1085.34

That is, they are very close to being equally fast. In fact, which ever was the fastest differed from test to test, due to the stochastic nature of a benchmark. The same was true for MatMultSimple and MatMultComplex, though they only needed 50,000 runs.

If you really want to measure and know more, you should embrace the stochastic nature of this benchmark and approximate the distributions of the "Duration" measurements. Including a random order of the functions, to remove any ordering bias.

EDIT: The assembly code (from user2079303's answer) prove outright that there are no differences with optimization enabled. Thus, the zero-cost abstractions are in fact zero-cost with optimization enabled, which is a reasonable requirement.

UPDATE:

The compiler I used:

g++ (Debian 6.3.0-6) 6.3.0 20170205

With the following command:

g++ -Wall -Wextra -pedantic -O3 test.cpp

Using this processor:

Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz

Which compiler do you use and which optimization level. Increasing RUNS does not change the ratio at all in my case. — The Floe, Apr 03 '17 at 12:39

eerorika · Answer 2 · 2017-04-03T12:04:58.050

why ... is much faster without compiler optimization. Why?

For whatever reason the compiler chooses. If you don't let the compiler optimize, then you cannot expect two different pieces of code to have similar performance, even if they have identical behaviour. When optimization is enabled, the compiler may be able to transform the abstracted code into the efficient one, and the performance should be comparable.

The use of std::array involves function calls where the use of a pointer does not. For example, std::array::operator[] is a function, while the subscript operator of a pointer is not. Making a function call is potentially slower than not making a function call. All those function calls can be optimized away (expanded inline), but if you choose to not enable optimization, then the function calls remain.

But for the matrix multiplication there is still a significant speedup when using C arrays.

Probably a quirk in your benchmark, or the compiler. Here both functions have identical assembly, and so have identical performance

EDIT: I concur with Jonas' answer. The benchmark has way too few iterations. Also, it is not possible to say whether the difference between two measurements is significant without repeating the benchmark, and analysing the deviation.

The conclusios are:

The C array is not faster than std::array when optimization is enabled. At least not when compiling with clang 3.9.1 as demonstrated by the link. Perhaps your compiler produces different assembly, but I see no reason why it should.
The zero cost abstractions of C++ are zero cost only after optimization.
Writing meaningful micro benchmarks is not trivial.

The C version of the matrix is not the same as the std::array version. The C array is simply a block of ROWS * COLS doubles. The std::array version is, at least conceptually an array of arrays. That means to fully dereference it requires two subscripting operations whereas the C version is one calculation. I am impressed that some modern C++ compilers are able to optimise the abstraction overhead away almost completely. — JeremyP, Apr 03 '17 at 11:32
@JeremyP surely a 2D array is also conceptually an array of arrays. The only difference that I see is the fact that the two dereferences of `std::array` are within separate (inline) function calls, so it might be sensitive to order of optimization passes. The functions must be expanded before the dereferences can be merged, while the two dereferences of the pointer are there even before inlining. Frankly, I would have been disappointed if the assembly output wasn't (at least almost) identical. — eerorika, Apr 03 '17 at 11:40
@user2079303 I partially agree with your answer. Increasing RUNS doesn't change the behavior with our compiler. The ratio remains the same. I figured that there is a "quirck in benchmark" after following Uli Schlachters comment: If the results from the functions are actually used after calculating, both implementation perform approximately equal. — The Floe, Apr 03 '17 at 12:27
@TheFloe I recommend to compare the assembly output given by your compiler and see how they differ. — eerorika, Apr 03 '17 at 12:29

Why is C array so much faster than std::array?

2 Answers2