I have what seems to be a very simple parallel for
loop, which is just writing zeros to an integer array. But it turns out the more threads, the slower the loop gets. I thought that this was due to some cache thrashing so I played around with schedules, chunk size, __restrict__
, nesting the parallel for inside a parallel block, and flushes. Then I noticed that reading an array to do a reduction is also slower.
This should be obviously very simple and should speed up nearly linearly. What am I missing here?
Full code:
#include <omp.h>
#include <vector>
#include <iostream>
#include <ctime>
void tic(), toc();
int main(int argc, const char *argv[])
{
const int COUNT = 100;
const size_t sz = 250000 * 200;
std::vector<int> vec(sz, 1);
std::cout << "max threads: " << omp_get_max_threads()<< std::endl;
std::cout << "serial reduction" << std::endl;
tic();
for(int c = 0; c < COUNT; ++ c) {
double sum = 0;
for(size_t i = 0; i < sz; ++ i)
sum += vec[i];
}
toc();
int *const ptr = vec.data();
const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int
std::cout << "parallel reduction" << std::endl;
tic();
for(int c = 0; c < COUNT; ++ c) {
double sum = 0;
#pragma omp parallel for default(none) reduction(+:sum)
for(int i = 0; i < sz_i; ++ i)
sum += ptr[i];
}
toc();
std::cout << "serial memset" << std::endl;
tic();
for(int c = 0; c < COUNT; ++ c) {
for(size_t i = 0; i < sz; ++ i)
vec[i] = 0;
}
toc();
std::cout << "parallel memset" << std::endl;
tic();
for(int c = 0; c < COUNT; ++ c) {
#pragma omp parallel for default(none)
for(int i = 0; i < sz_i; ++ i)
ptr[i] = 0;
}
toc();
return 0;
}
static clock_t ClockCounter;
void tic()
{
ClockCounter = std::clock();
}
void toc()
{
ClockCounter = std::clock() - ClockCounter;
std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl;
}
Running this yields:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 12
serial reduction
elapsed clock ticks: 1790000
parallel reduction
elapsed clock ticks: 19690000
serial memset
elapsed clock ticks: 3860000
parallel memset
elapsed clock ticks: 20800000
If I run with -O2
, g++ is able to optimize the serial reduction away and I get zero time, thus -O1
. Also, putting omp_set_num_threads(1);
makes the times more similar, although there are still some differences:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 1
serial reduction
elapsed clock ticks: 1770000
parallel reduction
elapsed clock ticks: 7370000
serial memset
elapsed clock ticks: 2290000
parallel memset
elapsed clock ticks: 3550000
This should be fairly obvious, I feel like I'm not seeing something very elementary. My CPU is Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz with hyperthreading, but the same behavior is observed at a colleague's i5 with 4 cores and no hyperthreading. We're both running Linux.
EDIT
It seems that one err was on the side of timing, running with:
static double ClockCounter;
void tic()
{
ClockCounter = omp_get_wtime();//std::clock();
}
void toc()
{
ClockCounter = omp_get_wtime()/*std::clock()*/ - ClockCounter;
std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl;
}
yields more "reasonable" times:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 12
serial reduction
elapsed clock ticks: 1.80974
parallel reduction
elapsed clock ticks: 2.07367
serial memset
elapsed clock ticks: 2.37713
parallel memset
elapsed clock ticks: 2.23609
But still, there is no speedup, it is just not slower anymore.
EDIT2:
As suggested by user8046, the code is heavily memory bound. and as suggested by Z boson, the serial code is easily optimized away and it is not certain what is being measured here. So I did a small change of putting the sum outside of the loop, so that it does not zero out at every iteration over c
. I also replaced the reduction operation with sum+=F(vec[i])
and memset operation with vec[i] = F(i)
. Running as:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 -D"F(x)=sqrt(double(x))"
./omp_test
max threads: 12
serial reduction
elapsed clock ticks: 23.9106
parallel reduction
elapsed clock ticks: 3.35519
serial memset
elapsed clock ticks: 43.7344
parallel memset
elapsed clock ticks: 6.50351
Calculating the square root adds more work to the threads and there is finally some reasonable speedup (it is about 7x
, which makes sense, as the hyperthreaded cores share memory lanes).