I have extracted this simple member function from a larger 2D program, all it does is a for loop accessing from three different arrays and doing a math operation (1D convolution). I have been testing with using OpenMP to make this particular function faster:
void Image::convolve_lines()
{
const int *ptr0 = tmp_bufs[0];
const int *ptr1 = tmp_bufs[1];
const int *ptr2 = tmp_bufs[2];
const int width = Width;
#pragma omp parallel for
for ( int x = 0; x < width; ++x )
{
const int sum = 0
+ 1 * ptr0[x]
+ 2 * ptr1[x]
+ 1 * ptr2[x];
output[x] = sum;
}
}
If I use gcc 4.7 on debian/wheezy amd64 the overall programm performs a lot slower on an 8 CPUs machine. If I use gcc 4.9 on a debian/jessie amd64 (only 4 CPUs on this machine) the overall program perform with very little difference.
Using time
to compare:
single core run:
$ ./test black.pgm out.pgm 94.28s user 6.20s system 84% cpu 1:58.56 total
multi core run:
$ ./test black.pgm out.pgm 400.49s user 6.73s system 344% cpu 1:58.31 total
Where:
$ head -3 black.pgm
P5
65536 65536
255
So Width
is set to 65536
during execution.
If that matter, I am using cmake for compilation:
add_executable(test test.cxx)
set_target_properties(test PROPERTIES COMPILE_FLAGS "-fopenmp" LINK_FLAGS "-fopenmp")
And CMAKE_BUILD_TYPE is set to:
CMAKE_BUILD_TYPE:STRING=Release
which implies -O3 -DNDEBUG
My question, why is this for
loop not faster using multi-core ? There is no overlap on the array, openmp should split the memory equally. I do not see where bottleneck is coming from ?
EDIT: as it was commented, I changed my input file into:
$ head -3 black2.pgm
P5
33554432 128
255
So Width
is now set to 33554432
during execution (should be considered by enough). Now the timing reveals:
single core run:
$ ./test ./black2.pgm out.pgm 100.55s user 5.77s system 83% cpu 2:06.86 total
multi core run (for some reason cpu% was always below 100%, which would indicate no threads at all):
$ ./test ./black2.pgm out.pgm 117.94s user 7.94s system 98% cpu 2:07.63 total