We are currently writing some performance critical code in C++ that operates on many large matrices and vectors. Regarding to our research, there should be no major performance difference between std::array
and standard C arrays
(See This question or this).
However, while testing we experienced a huge performance improvement by using C arrays over std::array
.
This is our demo code:
#include <iostream>
#include <array>
#include <sys/time.h>
#define ROWS 784
#define COLS 100
#define RUNS 50
using std::array;
void DotPComplex(array<double, ROWS> &result, array<double, ROWS> &vec1, array<double, ROWS> &vec2){
for(int i = 0; i < ROWS; i++){
result[i] = vec1[i] * vec2[i];
}
}
void DotPSimple(double result[ROWS], double vec1[ROWS], double vec2[ROWS]){
for(int i = 0; i < ROWS; i++){
result[i] = vec1[i] * vec2[i];
}
}
void MatMultComplex(array<double, ROWS> &result, array<array<double, COLS>, ROWS> &mat, array<double, ROWS> &vec){
for (int i = 0; i < COLS; ++i) {
for (int j = 0; j < ROWS; ++j) {
result[i] += mat[i][j] * vec[j];
}
}
}
void MatMultSimple(double result[ROWS], double mat[ROWS][COLS], double vec[ROWS]){
for (int i = 0; i < COLS; ++i) {
for (int j = 0; j < ROWS; ++j) {
result[i] += mat[i][j] * vec[j];
}
}
}
double getTime(){
struct timeval currentTime;
gettimeofday(¤tTime, NULL);
double tmp = (double)currentTime.tv_sec * 1000.0 + (double)currentTime.tv_usec/1000.0;
return tmp;
}
array<double, ROWS> inputVectorComplex = {{ 0 }};
array<double, ROWS> resultVectorComplex = {{ 0 }};
double inputVectorSimple[ROWS] = { 0 };
double resultVectorSimple[ROWS] = { 0 };
array<array<double, COLS>, ROWS> inputMatrixComplex = {{0}};
double inputMatrixSimple[ROWS][COLS] = { 0 };
int main(){
double start;
std::cout << "DotP test with C array: " << std::endl;
start = getTime();
for(int i = 0; i < RUNS; i++){
DotPSimple(resultVectorSimple, inputVectorSimple, inputVectorSimple);
}
std::cout << "Duration: " << getTime() - start << std::endl;
std::cout << "DotP test with C++ array: " << std::endl;
start = getTime();
for(int i = 0; i < RUNS; i++){
DotPComplex(resultVectorComplex, inputVectorComplex, inputVectorComplex);
}
std::cout << "Duration: " << getTime() - start << std::endl;
std::cout << "MatMult test with C array : " << std::endl;
start = getTime();
for(int i = 0; i < RUNS; i++){
MatMultSimple(resultVectorSimple, inputMatrixSimple, inputVectorSimple);
}
std::cout << "Duration: " << getTime() - start << std::endl;
std::cout << "MatMult test with C++ array: " << std::endl;
start = getTime();
for(int i = 0; i < RUNS; i++){
MatMultComplex(resultVectorComplex, inputMatrixComplex, inputVectorComplex);
}
std::cout << "Duration: " << getTime() - start << std::endl;
}
Compiled with: icpc demo.cpp -std=c++11 -O0
This is the result:
DotP test with C array:
Duration: 0.289795 ms
DotP test with C++ array:
Duration: 1.98413 ms
MatMult test with C array :
Duration: 28.3459 ms
MatMult test with C++ array:
Duration: 175.15 ms
With -O3
flag:
DotP test with C array:
Duration: 0.0280762 ms
DotP test with C++ array:
Duration: 0.0288086 ms
MatMult test with C array :
Duration: 1.78296 ms
MatMult test with C++ array:
Duration: 4.90991 ms
The C array implementation is much faster without compiler optimization. Why?
Using compiler optimizations the dot products are equally fast. But for the matrix multiplication there is still a significant speedup when using C arrays.
Is there a way to achieve equal performance when using std::array
?
Update:
The compiler used: icpc 17.0.0
With gcc 4.8.5
our code runs much slower than with the intel compiler using any optimization level. Therefore we are mainly interested in the behaviour of the intel compiler.
As suggested by Jonas we adjusted RUNS 50.000
with the following results (intel compiler):
With -O0
flag:
DotP test with C array:
Duration: 201.764 ms
DotP test with C++ array:
Duration: 1020.67 ms
MatMult test with C array :
Duration: 15069.2 ms
MatMult test with C++ array:
Duration: 123826 ms
With -O3
flag:
DotP test with C array:
Duration: 16.583 ms
DotP test with C++ array:
Duration: 15.635 ms
MatMult test with C array :
Duration: 980.582 ms
MatMult test with C++ array:
Duration: 2344.46 ms