I have a C++ program which basically performs some matrix calculations. For these I use LAPACK/BLAS and usually link to the MKL or ACML depending on the platform. A lot of these matrix calculations operate on different independent matrices and hence I use std::thread's to let these operations run in parallel. However, I noticed that I have no speed-up when using more threads. I traced the problem down to the daxpy Blas routine. It seems that if two threads are using this routine in parallel each thread takes twice the time, even though the two threads operate on different arrays.
The next thing I tried was writing a new simple method to perform vector additions to replace the daxpy routine. With one thread this new method is as fast as the BLAS routine, but, when compiling with gcc, it suffers from the same problems as the BLAS routine: doubling the number of threads running parallel also doubles the amount of time each threads needs, so no speed-up is gained. However, using the Intel C++ Compiler this problems vanishes: with increasing number of threads the time a single thread needs is constant.
However, I need to compile as well on systems where no Intel compiler is available. So my questions are: why is there no speed-up with the gcc and is there any possibility of improving the gcc performance?
I wrote a small program to demonstrate the effect:
// $(CC) -std=c++11 -O2 threadmatrixsum.cpp -o threadmatrixsum -pthread
#include <iostream>
#include <thread>
#include <vector>
#include "boost/date_time/posix_time/posix_time.hpp"
#include "boost/timer.hpp"
void simplesum(double* a, double* b, std::size_t dim);
int main() {
for (std::size_t num_threads {1}; num_threads <= 4; num_threads++) {
const std::size_t N { 936 };
std::vector <std::size_t> times(num_threads, 0);
auto threadfunction = [&](std::size_t tid)
{
const std::size_t dim { N * N };
double* pA = new double[dim];
double* pB = new double[dim];
for (std::size_t i {0}; i < N; ++i){
pA[i] = i;
pB[i] = 2*i;
}
boost::posix_time::ptime now1 =
boost::posix_time::microsec_clock::universal_time();
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
boost::posix_time::ptime now2 =
boost::posix_time::microsec_clock::universal_time();
boost::posix_time::time_duration dur = now2 - now1;
times[tid] += dur.total_milliseconds();
delete[] pA;
delete[] pB;
};
std::vector <std::thread> mythreads;
// start threads
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads.emplace_back(threadfunction, n);
}
// wait for threads to finish
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads[n].join();
std::cout << " Thread " << n+1 << " of " << num_threads
<< " took " << times[n]<< "msec" << std::endl;
}
}
}
void simplesum(double* a, double* b, std::size_t dim){
for(std::size_t i{0}; i < dim; ++i)
{*(++a) += *(++b);}
}
The outout with gcc:
Thread 1 of 1 took 532msec
Thread 1 of 2 took 1104msec
Thread 2 of 2 took 1103msec
Thread 1 of 3 took 1680msec
Thread 2 of 3 took 1821msec
Thread 3 of 3 took 1808msec
Thread 1 of 4 took 2542msec
Thread 2 of 4 took 2536msec
Thread 3 of 4 took 2509msec
Thread 4 of 4 took 2515msec
The outout with icc:
Thread 1 of 1 took 663msec
Thread 1 of 2 took 674msec
Thread 2 of 2 took 674msec
Thread 1 of 3 took 681msec
Thread 2 of 3 took 681msec
Thread 3 of 3 took 681msec
Thread 1 of 4 took 688msec
Thread 2 of 4 took 689msec
Thread 3 of 4 took 687msec
Thread 4 of 4 took 688msec
So, with the icc the time needed for one thread perform the computations is constant (as I would have expected; my CPU has 4 physical cores) and with the gcc the time for one thread increases. Replacing the simplesum routine by BLAS::daxpy yields the same results for icc and gcc (no surprise, as most time is spent in the library), which are almost the same as the above stated gcc results.