Is OpenMP (parallel for) in g++ 4.7 not very efficient? 2.5x at 5x CPU

Question

I've tried using OpenMP with a single #pragma omp parallel for, and it resulted in my programme going from a runtime of 35s (99.6% CPU) to 14s (500% CPU), running on Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz. That's the difference between compiling with g++ -O3 and g++ -O3 -fopenmp, both with gcc (Debian 4.7.2-5) 4.7.2 on Debian 7 (wheezy).

Why is it only using 500% CPU at most, when the theoretical maximum would be 800%, since the CPU is 4 core / 8 threads? Shouldn't it be reaching at least low 700s?
Why am I only getting a 2.5x improvement in overall time, yet at a cost of 5x in CPU use? Cache thrashing?

The whole programme is based on C++ string manipulation, with recursive processing (using a lot of .substr(1) and some concatenation), where said strings are continuously inserted into a vector of set.

In other words, basically, there are about 2k loop iterations done in a single parallel for loop, operating on vector, and each one of them may do two recursive calls to itself w/ some string .substr(1) and + char concatenation, and then the recursion terminates with set .insert of either a single string or a concatenation of two strings, and the said set .insert also takes care of a significant number of duplicates that are possible.

Everything runs correctly and well within the spec, but I'm trying to see if it can run faster. :-)

Can you provide an [mcve]? There are many important bits missing from your description, most importantly the size of the working set. Is the `set` shared between threads? Have you confirmed your `OMP_NUM_THREADS` environment variable setting? — Zulan, Apr 30 '16 at 07:48
@Zulan, no, each iteration of the `for` loop has its own `set` (otherwise, I doubt it'll even work, since it wouldn't be thread-safe), which are all part of a `vector` structure. there is no intentional sharing of anything between the threads; certainly no sharing of any mutable objects. — cnst, Apr 30 '16 at 16:53
ok, funny story -- i changed my recursive function to have all the string input as `const`, and speed went up from 35s to 27s uniprocessor (finally matching my golang and erlang implementations), and 12s on mp; however, that's still only a 2.5x improvement at 5x resources. — cnst, Apr 30 '16 at 17:07
well, i think you guys are right to dismiss hyperthreading -- i did `schedule(static) num_threads(4)` at the end of that pragma, and i'm now reliably getting below 12s and at only 320% (e.g., actually a second faster than the prior `const` optimisation, which could run as fast as 11.3s or as slow as 13s, depending on its mood); is there a way to automatically detect hyperthreading, and spin up only half the threads? for what it's worth, i think the whole contention resolves around automatic memory allocation. — cnst, Apr 30 '16 at 17:53
You can certainly [detect Hyper Threading](http://stackoverflow.com/q/2901694/620382), but it is more reliable to control the threading with a system and compiler specific configuration. Often you also want to consider affinity settings. — Zulan, Apr 30 '16 at 18:21
Wow, so, I now tried `schedule(dynamic) num_threads(5)`, and i'm reliably getting 9.8s 474%! That's 27/9.8 = 2.75x speedup, at 4.7x CPU! The speed goes above 10s with a higher number of threads, and is 10.0s at 4 threads. Anyhow, there's a followup at http://stackoverflow.com/questions/36958661/openmp-dont-use-hyperthreading-cores-half-num-threads-w-hyperthreading. — cnst, Apr 30 '16 at 18:45

score 3 · Answer 1 · answered Apr 30 '16 at 08:16

Based on your description you can draw the following conclusions:

^{I am assuming OpenMP really uses 8 threads (verify by export OMP_NUM_THREADS=8)}

500% CPU means, that there is significant time spent in barriers. This is due to bad load balance: different iterations taking varying amount of time. Hence the default (static) loop scheduling is inefficient, try different kinds of loop scheduling, e.g. dynamic.
If the runtime time does not decrease proportionally to the number of threads (e.g. the overall CPU time spent increases), there is either a shared resource that acts as a bottleneck, or an influence between the threads. Note this can also be the result of short barriers (from load imbalances), that perform busy waiting instead of blocking.
- The most common shared resource is the memory bandwidth. Whether that affects you, depends on whether your working set fits into local caches. Considering the many memory hierarchies and NUMA properties in modern systems, this can become very complex to understand. The is nothing fundamental you can do against that short of restructuring your data accesses to use caches more efficiently (blocking).
- False sharing: If multiple threads write & read to the same memory locations (cache lines), the respective memory accesses become much slower. Try to restrict writes within the parallel loop to private variables.
- HyperThreading - a core being a shared resource between two hardware threads.
  
  using 5x resources
  
  This is a fundamental misunderstanding. The additional hardware thread provides little additional resources to each core. If two threads run on the same core, they will share computational resources and memory bandwidth, basically the only advantage is hidden I/O latency. You will never see 2x speedup, despite 2x CPU time. Sometimes it will give you 10% speedup, sometimes it will be slower.
  
  Now consider 5 active threads running on 4 cores. 2 threads, sharing a core, will run only at ~50% speed, slowing everything down. Try reducing the number of threads to the number of cores (export OMP_NUM_THREADS=4).

interestingly, i've tried `schedule (dynamic)`, and I am getting 760% CPU, but the time the programme runs is still the same (actually, half a second slower), so, it's even worse -- 2.5x speedup at 8x resource use. — cnst, Apr 30 '16 at 17:28

simpel01 · Answer 2 · 2016-04-30T04:55:12.343

Achieving linear speedup is usually not trivial, and it gets harder the more cores you have. There might be several aspects you may need to look for:

False sharing: if the array is not sliced properly, a cache line may be in contention between two cores, usually the two halves of the cache line are written by two threads causing the cache line to move from the L2 cache of one core to the other. Cache sharing can also happen if you are using lots of shared variables. In that case consider using private or firstprivate to avoid synchronization.
OpenMP scheduling: if not specified, OpenMP will split the iterations of the loop in equal manner among the threads in the system and assign a subrange to each of them. However, if the amount of work for each index vary then you may end-up in a situation where most of the threads are finish their work and they are blocked at the barrier at the end of the parallel region waiting for the slower thread (the one that got more work to do). This depends on the type of algorithm you have implemented inside your loop, but OpenMP gives you the opportunity to change scheduling policy using the schedule() directive. Consider trying dynamic at least.

I think you misread my question -- i'm only seeing a 2.5x speedup (using 5x resources), which is certainly better than nothing, but after the initial euphoria, does seem to leave quite some room for further improvement. — cnst, Apr 30 '16 at 04:51
I would try increase the size of the loop and see if your speedup improves. If not, it might be your program is IO-bound which means it is mostly waiting fetching stuff from memory and it does not have a good cache behavior. — simpel01, Apr 30 '16 at 04:57

Is OpenMP (parallel for) in g++ 4.7 not very efficient? 2.5x at 5x CPU

2 Answers2

Linked