Why is this for loop not faster using OpenMP?

Question

I have extracted this simple member function from a larger 2D program, all it does is a for loop accessing from three different arrays and doing a math operation (1D convolution). I have been testing with using OpenMP to make this particular function faster:

void Image::convolve_lines()
{
  const int *ptr0 = tmp_bufs[0];
  const int *ptr1 = tmp_bufs[1];
  const int *ptr2 = tmp_bufs[2];
  const int width = Width;
#pragma omp parallel for
  for ( int x = 0; x < width; ++x )
    {
    const int sum = 0
      + 1 * ptr0[x]
      + 2 * ptr1[x]
      + 1 * ptr2[x];
    output[x] = sum;
    }
}

If I use gcc 4.7 on debian/wheezy amd64 the overall programm performs a lot slower on an 8 CPUs machine. If I use gcc 4.9 on a debian/jessie amd64 (only 4 CPUs on this machine) the overall program perform with very little difference.

Using time to compare: single core run:

$ ./test black.pgm out.pgm  94.28s user 6.20s system 84% cpu 1:58.56 total

multi core run:

$ ./test black.pgm out.pgm  400.49s user 6.73s system 344% cpu 1:58.31 total

Where:

$ head -3 black.pgm
P5
65536 65536
255

So Width is set to 65536 during execution.

If that matter, I am using cmake for compilation:

add_executable(test test.cxx)
set_target_properties(test PROPERTIES COMPILE_FLAGS "-fopenmp" LINK_FLAGS "-fopenmp")

And CMAKE_BUILD_TYPE is set to:

CMAKE_BUILD_TYPE:STRING=Release

which implies -O3 -DNDEBUG

My question, why is this for loop not faster using multi-core ? There is no overlap on the array, openmp should split the memory equally. I do not see where bottleneck is coming from ?

EDIT: as it was commented, I changed my input file into:

$ head -3 black2.pgm
P5
33554432 128
255

So Width is now set to 33554432 during execution (should be considered by enough). Now the timing reveals:

single core run:

$ ./test ./black2.pgm out.pgm  100.55s user 5.77s system 83% cpu 2:06.86 total

multi core run (for some reason cpu% was always below 100%, which would indicate no threads at all):

$ ./test ./black2.pgm out.pgm  117.94s user 7.94s system 98% cpu 2:07.63 total

in general, false sharing/lock contention. Also, how big is `width`? — sehe, Jan 12 '15 at 13:44
How did you test that ? I doubt the single 64k loop you gave take that much time. — ElderBug, Jan 12 '15 at 14:46
If you are calling this function 65536 times, the overhead of starting a new parallel region each time will be huge and could offset the savings from running the kernel in parallel. Besides, your code might suffer from the problem described [here](http://stackoverflow.com/questions/13636464/slow-sparse-matrix-vector-product-csr-using-open-mp). Check the memory utilisation with a performance tool such as LIKWID. — Hristo Iliev, Jan 12 '15 at 14:51
@HristoIliev About the link, the code here doesn't suffer from this problem. Here the access is linear, not like a sparse matrix, so it should be efficiently cached. — ElderBug, Jan 12 '15 at 14:57
@ElderBug, the arrays are used in a streaming fashion. Nothing from the code shown suggests that they are reused. — Hristo Iliev, Jan 12 '15 at 14:59
@HristoIliev What I am saying is that there is only 3 arrays, all accessed sequentially (0,1,2,3...), so it should be efficiently cached (caching next line in RAM is easy and fast). The problem is more likely what you said first, because 64k is way too low to split, so the overhead is huge. — ElderBug, Jan 12 '15 at 15:13
@ElderBug, the image is 65536x65536 8-bit grayscale. That is at least 4 GiB of image data. The arrays are of type `int` which inflates the storage further to 16 GiB. No matter how they are accessed, if the data in them is not reused (and nothing shown suggests that it is), the data has to be brought in from the main memory, which limits the ability of the code to run in parallel. Reading the performance counters with a tool like LIKWID would prove or disprove that hypothesis. — Hristo Iliev, Jan 12 '15 at 15:46
@HristoIliev could you please document LIKWID usage ? The wiki page https://code.google.com/p/likwid/wiki/LikwidPerfCtr is not very clear to me (very new to the tool). — malat, Jan 12 '15 at 15:51
@HristoIliev What you say is true, but I disagree about whether it affects parallel performance. The main problem is usually cacheability. RAM is freaking fast, but not so much if access is random. Typical PC will have 10GiB/s or 20GiB/s in RAM bandwidth. 16GiB would take ~1s, waaaay less than the 120s measured (anyway, only really simple processing can go faster than RAM). The problem could be that memory is not accessed sequentially, but here it is (more or less). I would expect at least 50% RAM bandwidth on sequential access on 3 arrays. But of course this is also a guess. — ElderBug, Jan 12 '15 at 16:02
`likwid-perfctr -g MEM -C 0:3 ./program` on the 4-core system or `-C 0:7` on the 8-core one. Note that `-C` also enables thread binding, i.e. OpenMP threads won't get migrated by the OS scheduler from one core to another, which might significantly improve the performance. The output from the command should be pretty much self-explanatory. You might also try with `-c` instead of `-C` and compare the results. — Hristo Iliev, Jan 12 '15 at 16:09
The following error appear on both machine (i5 & i7): ERROR - [./src/perfmon.c:993] Uncore not supported on Desktop processors! — malat, Jan 12 '15 at 16:19
Tough luck, I guess your CPU is (still) not supported by LIKWID. You might still want to profile you application. First of all, try running it with `OMP_PROCBIND=true` and also try to time only the part where the work is being done, e.g. using `omp_get_wtime()` before and after that part. — Hristo Iliev, Jan 12 '15 at 16:27

PhD AP EcE · Accepted Answer · 2015-01-17T12:58:40.743

I have some general comments:

1. Before optimizing your code, make sure the data is 16 byte aligned. This is extremely important for whatever optimization one wants to apply. And if the data is separated into 3 pieces, it is better to add some dummy elements to make the starting addresses of the 3 pieces are all 16-byte aligned. By doing so, the CPU can load your data into cache lines easily.

2. Make sure the simple function is vectorized before implementing openMP. Most of cases, using AVX/SSE instruction sets should give you a decent 2 to 8X single thread improvement. And it is very simple for your case: create a constant mm256 register and set it with value 2, and load 8 integers to three mm256 registers. With Haswell processor, one addition and one multiplication can be done together. So theoretically, the loop should speed up by a factor 12 if AVX pipeline can be filled!

3. Sometimes parallelization can degrade performance: Modern CPU needs several hundreds to thousands clock cycles to warm up, entering high performance states and scaling up frequency. If the task is not large enough, it is very likely that the task is done before the CPU warms up and one cannot gain speed boost by going parallel. And don't forget that openMP has overhead as well: thread creating, synchronization and deletion. Another case is poor memory management. Data accesses are so scattered, all CPU cores are idle and waiting for data from RAM.

My Suggestion:

You might want to try intel MKL, don't reinvent the wheel. The library is optimized to extreme and there is no clock cycle wasted. One can link with the serial library or the parallel version, a speed boost is guaranteed if it is possible by going parallel.

Why is this for loop not faster using OpenMP?

1 Answers1