What limits scaling in this simple OpenMP program?

Question

I'm trying to understand limits to parallelization on a 48-core system (4xAMD Opteron 6348, 2.8 Ghz, 12 cores per CPU). I wrote this tiny OpenMP code to test the speedup in what I thought would be the best possible situation (the task is embarrassingly parallel):

// Compile with: gcc scaling.c -std=c99 -fopenmp -O3                                                                                               

#include <stdio.h>
#include <stdint.h>

int main(){

  const uint64_t umin=1;
  const uint64_t umax=10000000000LL;
  double sum=0.;
#pragma omp parallel for reduction(+:sum)
  for(uint64_t u=umin; u<umax; u++)
    sum+=1./u/u;
  printf("%e\n", sum);

}

I was surprised to find that the scaling is highly nonlinear. It takes about 2.9s for the code to run with 48 threads, 3.1s with 36 threads, 3.7s with 24 threads, 4.9s with 12 threads, and 57s for the code to run with 1 thread.

Unfortunately I have to say that there is one process running on the computer using 100% of one core, so that might be affecting it. It's not my process, so I can't end it to test the difference, but somehow I doubt that's making the difference between a 19~20x speedup and the ideal 48x speedup.

To make sure it wasn't an OpenMP issue, I ran two copies of the program at the same time with 24 threads each (one with umin=1, umax=5000000000, and the other with umin=5000000000, umax=10000000000). In that case both copies of the program finish after 2.9s, so it's exactly the same as running 48 threads with a single instance of the program.

What's preventing linear scaling with this simple program?

The problem maybe because all threads are accessing the variable sumat the same time. OT but `sum+=1./(u*u);` will be faster and maybe paralellized better since the CPU may have multiple multipliers available and don't have to deal with slow division — phuclv, Nov 05 '13 at 01:59
@LưuVĩnhPhúc The different threads aren't accessing `sum` at the same time because of reduction. Also, in this case sum+=1./(u*u); is actually slower, because you have to use 128-bit integers for the multiply, while the two divides can both be 64-bit. Anyway, it's just a simple test code, the difference between multiplies and divides still doesn't explain the poor scaling! — Douglas B. Staple, Nov 05 '13 at 02:06
There is one observation. You have quite good scaling on single core (4.9 sec for 12 thread ~ 57 sec/12=4.8 sec) and bad scaling for multiply core. I do not know your system, so I have not more comments. Another suggestions - did you try control the actual number of thread and use for time measurements OpenMP functions? (omp_get_num_threads, omp_get_wtime). It may be actual number of threads was less. — SergV, Nov 05 '13 at 03:22
My typo on comment above. - "You have quite good scaling on single **CPU** and bad scaling for multiply **CPU** — SergV, Nov 05 '13 at 04:48
AMD's modern term for a core is misleading. It's similar to a Intel's hyperthreads. You really only have 3 modules in the Opteron 6348. Each module has what AMD calls two cores but they share a floating point unit. You can't really expect linear scaling just like you can't expect linear scaling with hyper-threading. In your case it should be more linear (depending on what you're doing) up to 24 threads and then be much less linear after that. — Z boson, Nov 05 '13 at 07:49
Just to be clear the [Opteron 6348](http://en.wikipedia.org/wiki/Opteron) has two dies each with three modeules each with two "cores" so 2*3*2 = 12 — Z boson, Nov 05 '13 at 08:25
The proper way to benchmark the scalability of this code (and any code in general) is to run on an empty system (i.e. no 100% CPU processes) AND with thread binding/pinning enabled in order to prevent threads from being migrated around. Also the parallel region has to be called twice and only the second entry timed, otherwise you are also measuring the overhead from creating the thread pool. — Hristo Iliev, Nov 05 '13 at 16:29
Please separate your findings from the question. You are actually allowed to post your own answer and even to accept it if you find the others unsatisfactory. — Hristo Iliev, Nov 09 '13 at 02:04

Salt · Answer 1 · 2013-11-05T03:59:07.490

3

I'm not sure this qualifies as an answer but it feels like more than a comment, so here we go.

I've never noticed particularly linear performance against the number of threads in any of my projects. For one thing, there's the scheduler, which is anything but rigorously fair, seems to me. OpenMP probably divides the task evenly among its team of threads at the outset, then joins each. On every Linux box I've had the pleasure of, I would expect a few threads to finish early, and a few threads to lag. Other platforms will vary. However that works out, of course you're waiting for the slowest to catch up. So stochastically speaking, there's a pulse of threading processing going by in something of a bell curve, the more threads the wider I should think, and you're never done until the trailing edge crosses the finish line.

What does top say? Does it tell you your process gets 2000% CPU at 20 threads, 4000% at 40? I bet it tapers off. htop by the way typically shows a process total, and separate lines for each thread. That might be interesting to watch.

With a tiny loop like that, you're probably not running into cache thrash or any such annoyance. But another issue that's bound to shave some performance off the top: like any modern multi-core CPU the Opteron runs at a higher clock rate when it's cool. The more cores you heat up, the less turbo mode you'll see.

edited Nov 05 '13 at 03:59

answered Nov 05 '13 at 03:50

Salt

156
2

I should add, with as many cores as threads one would expect scheduling to have a minimal impact. Thing is I'm not sure how minimal is minimal. Most interesting multi-threaded programs in my experience have more threads than cores. Still, I would expect some sloppiness. It might be interesting to run an experiment with explicit processor affinities. – Salt Nov 05 '13 at 05:05
1

The GNU implementation of OpenMP puts a docking barrier at the beginning of each parallel region, therefore all threads in the team start more or less synchronised. The default scheduling in GOMP is `static`, i.e. the work is evenly distributed and each thread executes iterations from a contiguous subset. +1 for the remark about core frequency boosting. – Hristo Iliev Nov 05 '13 at 09:08
1

@DouglasB.Staple, your code uses static scheduling and therefore load imbalance due to OS jitter matters heavily. Keep in mind that all threads in the team are synced by the implicit barrier at the end of each parallel region, therefore the slowest thread determines the total execution time of the region. The fact that one core is kept busy means that cores timeshare and perform constant context switches (and possibly thread migration too). – Hristo Iliev Nov 05 '13 at 16:24
@HristoIliev You are right, dynamic scheduling speeds up this little test code significantly. I'm going to have to retest when I get an opportunity with an unloaded system. In the real code I'm developing I basically have to use dynamic scheduling, but there the issue could be the overhead from the scheduling, or any of the other answers here. – Douglas B. Staple Nov 05 '13 at 16:49
@HristoIliev, those are interesting points you make. I don't have experience with multi-socket systems. But as I understand it's only one core on one of the processors that's kept busy so the other three processors should not be effected much? – Z boson Nov 06 '13 at 07:54
1

@redrum, process schedulers are quite dumb. The default is to try and keep all logical CPUs equally busy. Without binding some of the threads are going to timeshare with that other process and context switches between processes are very expensive because the TLB is flushed. This issue does not depend on the number of sockets - having more than one just makes it worse since inter-socket migration is even more expensive. – Hristo Iliev Nov 06 '13 at 09:45
@DouglasB.Staple, I'm confused by the fact that you see better results with dynamic scheduling. Did you explicitly specify the chunk size, e.g. `schedule(dynamic,100000)`? If not, the default is a chunk size of one iteration and the overhead should be HUGE. – Hristo Iliev Nov 06 '13 at 09:48
@HristoIliev Yes, I used `schedule(dynamic,1000000)` to get the best performance. Unfortunately the system is still loaded on one core and I have no idea when I'll be able to do precise timing. – Douglas B. Staple Nov 06 '13 at 13:06

Z boson · Answer 2 · 2013-11-05T15:03:34.487

I have two important points as two why your results are not linear. The first one is about Intel hyper-threading and AMD modules. The next one is about turbo frequency modes with Intel and AMD

1.) Hyper-threading and AMD modules/cores

Too many people confuse Intel Hyper threading and AMD cores in modules as real cores and expect a linear speed up. An Intel processor with hyper-threading can run twice as many hyper-threads/hardware threads as cores. AMD also has it's own technology where the fundamental unit is called a module and each module has what AMD disingenuously calls a core What's a module, what's a core. One reason this is easily confused is that for example with Task Mangager in windows with hyper-treading it shows the number of hardware threads but it says CPUs. This is misleading as it's not the number of CPU cores.

I don't have enough knowledge of AMD to go into details but as far as I understand each module has one floating point unit (but two integer units). Therefore, you can't really expect a linear speed up beyond the number of Intel cores or AMD modules for floating point operations.

In your case the Opteron 6348 has 2 dies per processor each with 3 modules which each as 2 "cores". Though this gives 12 cores there are really only 6 floating point units.

I ran your code on my single socket Intel Xeon E5-1620 @ 3.6 GHz. This has 4 cores and hyper-threading (so eight hardware threads). I get:

1 threads: 156s 
4 threads: 37s  (156/4 = 39s)
8 threads: 30s  (156/8 = 19.5s)

Notice that for 4 threads the scaling is almost linear but for 8 threads the hyper-threading only helps a little (at least it helps). Another strange observation is that my single threaded results are much lower than yours (MSVC2013 64bit release mode). I would expect a faster single threaded ivy bridge core to easily trump a slower AMD pile driver core. This does not make sense to me.

2.) Intel Turbo Boost and AMD Turbo Core.

Intel has a technology called Turbo Boost which changes the clock frequency based on the number of threads that are running. When all threads are being run the turbo boost is at it's lowest value. On Linux the only application I know that can measure this when an operation is running is powertop. Getting the real operating frequency is not something so easy to measure (for one it needs root access). On Windows you can use CPUz. In any case the result is that you can't expect linear scaling when only running one thread compared to running the maximum number of real cores.

Once again, I have little experience with AMD processors but as far as I can tell their technology is called Turbo Core and I expect the effect to be similar. This is the reason that a good benchmark disables turbo frequency modes (in the BIOS if you can) when comparing threaded code.

Both of your comments are correct and useful, but HristoIliev's comments to the other answer make me question the limiting factor in this specific situation. I'm going to have to take a look at it again tonight when I have more time. — Douglas B. Staple, Nov 05 '13 at 16:52
Check the assembly output from MSVC (there is an option in the project settings to save the assembly files). GCC compiles the code into a very tight loop that simply converts RDX (holds `u`) into XMM2 using `CVTSI2SD`, loads a constant `1.0` from memory (with RIP-relative addressing) into XMM1, uses two times `DIVSD XMM1,XMM2` to perform the division, adds the result to XMM0 and loops again. The upper bound of the iteration space is held in RCX. The reduction phase is implemented using a tight mutex loop. No fancy vectorisation stuff - just plain scalar SSE arithmetic. — Hristo Iliev, Nov 05 '13 at 22:39
@DouglasB.Staple, yes, Hristo's comments are interesting. I think my comment at least would explain why you should not expect linear scaling after 24 threads. But the fact that dynamic scheduling is better is very odd. Dynamic scheduling has a large overhead and the job your doing has a even work load so static scheduling would normally be the right choice. — Z boson, Nov 06 '13 at 07:56
With a sufficiently large chunk size the overhead for the dynamic scheduling will be greatly reduced (negligible?). I used `schedule(dynamic,1000000)`, which means only 10000 blocks. — Douglas B. Staple, Nov 06 '13 at 13:15
I think the reason dynamic scheduling is faster here is that, with static scheduling, if there is any other process at all running on the system, and the same number of threads as 'cores', there is going to be bad imbalance between the threads. I.e. whichever thread shares a core with another process at any time is going to run way slower. This will necessitate switching threads between cores to balance the execution time, incurring the overhead Hristo is talking about. — Douglas B. Staple, Nov 06 '13 at 13:17
You should try binding the threads. You can do this with GOMP. I have not tried it yet. Look up GOMP_CPU_AFFINITY . I have seen Hristo discuss how to do it with GCC on SO. — Z boson, Nov 06 '13 at 13:19
Try this, [bind-threads-to-specific-cpu-cores-using-openmp](http://stackoverflow.com/questions/14049532/bind-threads-to-specific-cpu-cores-using-openmp) — Z boson, Nov 06 '13 at 13:27
I tried this: `export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47"`. It didn't seem to make any difference, surprisingly. — Douglas B. Staple, Nov 06 '13 at 17:52

score 2 · Accepted Answer · answered Nov 09 '13 at 02:10

I finally got a chance to benchmark the code with a completely unloaded system: enter image description here

For the dynamic schedule I used schedule(dynamic,1000000). For the static schedule I used the default (evenly between the cores). For thread binding I used export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47".

The main reason for the highly nonlinear scaling for this code is because what AMD calls "cores" aren't actually independent cores. This was part (1) of redrum's answer. This is clearly visible in the plot above from the sudden plateau of speedup at 24 threads; it's really obvious with the dynamic scheduling. It's also obvious from the thread binding that I chose: it turns out what I wrote above would be a terrible choice for binding, because you end up with two threads in each "module".

The second biggest slowdown comes from static scheduling with a large number number of threads. Inevitably there is an unbalance between the slowest and fastest threads, introducing large fluctuations in the run time when the iterations are divided in large chunks with the default static scheduling. This part of the answer came both from Hristo's comments and Salt's answer.

I don't know why the effects of "Turbo Boost" aren't more pronounced (part 2 of Redrum's answer). Also, I'm not 100% certain where (presumably in overhead) the last bit of the scaling comes is lost (we get 22x performance instead of expected 24x from linear scaling in number of modules). But otherwise the question is pretty well answered.

Glad to see you followed up with this. Those are interesting results. You said one core of one CPU was always at 100%. Therefore the max I would think is 23x (ignoring frequency boosting). — Z boson, Nov 11 '13 at 09:00
@Zboson Thanks. I may yet take a closer look to find out where the remaining loss is. When I first posted the question there was a process at 100%, but a few days later (when I posted this answer) the system was more or less completely idle. — Douglas B. Staple, Nov 11 '13 at 21:30

What limits scaling in this simple OpenMP program?

3 Answers3

Linked