Why would parallelization decrease performance so dramatically?

Question

I have an OpenMP program (thousands of lines, impossible to reproduce here) that works as follows:

It consists of worker threads along with a task queue.
A task consists of a convolution; every time a worker thread pops off a task from the work queue, it performs the required convolution and optionally pushes more convolutions onto the queue.
(There is no specific "master" thread; all workers are equal.)

When I run this program on my own machine (4-core HT non-NUMA Core i7), the running times I get are:

(#threads: running time)
 1: 5374 ms
 2: 2830 ms
 3: 2147 ms
 4: 1723 ms
 5: 1379 ms
 6: 1281 ms
 7: 1217 ms
 8: 1179 ms

This makes sense.

However, when I run it on a NUMA 48-core AMD Opteron 6168 machine, I get these running times:

 1: 9252 ms
 2: 5101 ms
 3: 3651 ms
 4: 2821 ms
 5: 2364 ms
 6: 2062 ms
 7: 1954 ms
 8: 1725 ms
 9: 1564 ms
10: 1513 ms
11: 1508 ms
12: 1796 ms  <------ why did it get worse?
13: 1718 ms
14: 1765 ms
15: 2799 ms  <------ why did it get *so much* worse?
16: 2189 ms
17: 3661 ms
18: 3967 ms
19: 4415 ms
20: 3089 ms
21: 5102 ms
22: 3761 ms
23: 5795 ms
24: 4202 ms

These results are pretty consistent, it's not an artifact of load on the machine.
So I don't understand:
What could cause the performance to drop so much after 12 cores?

I would understand if the performance saturated at some level (I could blame it on limited memory bandwidth), but I don't understand how it can drop from 1508 ms to 5795 ms by adding more threads.

How is this possible?

One thing I'd check first is for contention for the task queue. But lots of things can mess up a parallel program, and you'll probably need to crack open a parallel profiler like Tau or Scalasca. — Raman Shah, Feb 12 '14 at 23:11
@RamanShah: Would contention increase the time so suddenly/dramatically though? — user541686, Feb 12 '14 at 23:22
Hmya, that's what the "NU" in NUMA means. Once you get past the sweet spot, you get to see the cost of the processor interconnect. Pretty important that you treat them like separate machines connected through a network. Don't let them work on the same problem. — Hans Passant, Feb 12 '14 at 23:28
@HansPassant: I see. Unfortunately I don't see how it's possible to prevent them from working on the same problem, because the convolutions have arbitrary interdependence (think DAG) and the buffers must therefore be shared and available to all threads. I can't allocate memory on the "correct" node because as soon as I do that and a different node needs to read the data it will be on the wrong node (there is a DAG of interdependencies). What can I do here then? — user541686, Feb 13 '14 at 00:05
Well at this point you're getting to the world where you have to pay (in lines of code, complexity, and portability) to scale your performance. It may involve something like a master process that decomposes your DAG of dependencies to get the right data to the right parts of the address space with minimal cross-talk. — Raman Shah, Feb 13 '14 at 00:48
Also, it's hard to guess your use case, but if your 8-thread performance on your machine (the fastest walltime so far) is not good enough, it's worth asking whether you'll be outgrowing the shared-address-space machines at your disposal...it could be more flexible to switch to a distributed-memory design, e.g. with MPI. — Raman Shah, Feb 13 '14 at 00:52
@Mehrdad, have you looked into binding the threads to cores? As far as I understand the thread can migrate around so it may help to bind them to certain cores/processors. With GCC you can GOMP_CPU_AFFINITY. I think you have even more control with binding with ICC but no control with MSVC. With AMD I think you want to bind every other thread (due to modules which is sorta like HT) e.g. GOMP_CPU_AFFINITY=0 2 4 ... but I'm not sure. http://stackoverflow.com/questions/19780554/what-limits-scaling-in-this-simple-openmp-program/19791583#19791583 — Z boson, Feb 13 '14 at 08:16
@Zboson: I don't think it would help because the data are really global -- the buffers aren't allocated on a per-thread basis (since they're all interdependent). So it wouldn't really matter which thread is processing the data. — user541686, Feb 13 '14 at 08:18
@Mehrdad, how are the buffers allocated? On NUMA systems you want to make sure that any buffers you write to don't cross a page (usually 4096 bytes) otherwise you get something like false sharing. — Z boson, Feb 13 '14 at 08:22
@Zboson: They're allocated normally, just `new double[...]`. Not sure what you mean by not crossing a page, there's no reason why they would be smaller than 4K... a lot of them could easily cross a page because they're bigger than a page. — user541686, Feb 13 '14 at 08:25

Mats Petersson · Accepted Answer · 2014-02-13T08:21:39.280

These sort of situations can be quite hard to figure out. One key is to look at memory locality. Without seeing your code, it's impossible to say EXACTLY what is going wrong, but we can discuss some of the things that amke "multithreading less good":

In all NUMA systems, when the memory is located with processor X and the code running on processor Y (where X & Y aren't the same processor), every memory access will be bad for performance. So, allocating memory on the right NUMA node will certainly help. (This may require some special code, such as setting affinity masks and at least hinting to the OS/Runtime Systems that you want Numa-aware allocations). At the very least, ensure that you don't simply work on one large array that is allocated by the "first thread, then start lots more threads".

Another thing that is even worse is sharing or false sharing of memory - so if two or more processors are using the same cache-line, you will get a ping-pong match between those two processors, where each processor will do "I want memory at address A", get hold of the memory content, update it, and then the next processor will do the same thing.

The fact that results gets bad just at 12 threads seem to indicate that it's to do with "sockets" - either you are sharing data, or the data is located "on the wrong node". At 12 threads, it's likely that you start using the second socket (more), which will make these sort of problems more apparent.

For best performance, you need memory to be allocated on the local node, no sharing, and no locking. Your first set of results also look like they are not "ideal". I have some (absolutely non-sharing) code that gives exactly n-times better for number of processors, until I run out of processors (unfortunately, my machine only has 4 cores, so it's not very much better, but it's still 4x better than 1 core, and if I ever got my hands on a 48 or 64-core machine, it would produce 48 or 64 better results in calculating "weird numbers").

Edit:

The "Socket issue" is two things:

Memory locality: Basically, memory is attached to each socket, so if the memory is allocated from the region belonging to the "previous" socket, then you get extra latency reading the memory.
Cache/sharing: Within a processor, there are "fast" links to share data (and often a "bottom level shared cache", e.g. L3 cache), which allows for the cores within a socket to share data more efficiently than with those in a different socket.

All this amounts to something like working on servicing cars, but you don't have your own toolbox, so every time you need a tool, you have to ask your colleague next to you for a screwdriver, 15mm spanner, or whatever you need. And then give the tools back when your work area gets a bit full. It's not a very efficient way of working... It would be much better if you had tools of your own (at least the most common one - one of those special spanners that you only use once a month isn't a big issue, but your common 10, 12 and 15mm spanners and a few screwdrivers, for sure). And of course, it would get even worse if there are four mechanics, all sharing the same toolbox. This is the case where you have "all memory allocated on one node" in a four socket system.

Now imagine that you have a "box of spanners", and only one of the mechanics can use the box of spanners, so if you need a 12mm spanner, you have to wait for the guy next to you to finish using the 15mm spanner. This is what happens if you have "false cache-sharing" - the processor isn't really using the same value, but because there are more than one "thing" in the cacheline, the processors are sharing the cacheline (box of spanners).

+1 thanks. Unfortunately the entire reason behind the queue is that the convolutions depend on the previous convolutions, so I can't split the problem into separate parts. There's a DAG of dependencies, and performing every convolution releases a constraint on its parent nodes, and as soon as the last constraint on a convolution is released, the parent is placed onto the work queue. Every parent depends on its child's buffer, so I have no way of allocating the buffers on the "right" NUMA node because the problem cannot be split into disjoint parts (they're all interdependent). — user541686, Feb 13 '14 at 00:20
I think you hit the nail right on the head with the sockets. I didn't realize this, but 12 is exactly the number of CPUs with the same `physical id` according to `/proc/cpuinfo`, so it really does seem like we started using the 2nd socket. If you wouldn't mind going into the socket issue with a little more depth I'd really appreciate it. Is that the same problem as data being on the wrong NUMA node, or is it different? — user541686, Feb 13 '14 at 01:17
The false sharing you refer to should mention pages as well not just cache lines. That's another kind of false sharing that only happens on NUMA systems. Additionally, you can allocate the memory on one processor and use it on another but it should be aligned to a page (4096) and be a multiple of a page (4096). It's just like on a non-NUMA system. If you allocate the memory yourself outside of a parallel block then you should align it to 64 bytes and make it a multiple of 64 bytes http://stackoverflow.com/questions/21445901/openmp-implementation-of-reduction/21452204#21452204 — Z boson, Feb 13 '14 at 11:45
Anyone reading this should also [***take a look here***](http://unix.stackexchange.com/a/92306/6252), the diagram is worth a thousand words. — user541686, Feb 18 '14 at 08:37

score 2 · Answer 2 · edited May 23 '17 at 12:24

2

I have two suggestions:

1.) On NUMA systems you want to make sure that buffers you write to are aligned to page boundaries and as well are multiples of a page. Pages are usually 4096 bytes. If a buffer is split between pages you get false sharing.

http://dl.acm.org/citation.cfm?id=1295483

False sharing occurs when processors in a shared-memory parallel system make references to different data objects within the same coherence block (cache line or page), thereby inducing "unnecessary" coherence operations.

and this link https://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf

...false sharing which occurs when several independent objects which may have different access patterns are allocated to same unit of movable memory (in our case, a page of virutal memory).

So for example if an array is 5000 bytes you should make it 8192 bytes (2*4096). Then allign it with something like

float* array = (float*)_mm_malloc(8192, 4096);  //two pages both aligned to a page

On non NUMA systems you don't want multiple threads to write to the same cache line (usually 64 bytes). This causes false sharing. On NUMA systems you don't want multiple threads writing to the same page (usually 4096 bytes).

See some of the comments here Fill histograms (array reduction) in parallel with OpenMP without using a critical section

2.) OpenMP can migrate the threads to different cores/processors so you may want to bind the threads to certain cores/processors. You can do this with ICC and GCC. With GCC I think you want to do something like GOMP_CPU_AFFINITY=0 2 4... See this link What limits scaling in this simple OpenMP program?

edited May 23 '17 at 12:24

Community

1
1

answered Feb 13 '14 at 08:30

Z boson

32,619
11
123
226

I appreciate the answer but I already addressed both of these in the comments in response to your suggestions. Could you explain why you think they will be useful despite what I explained above? Thanks! – user541686 Feb 13 '14 at 08:31
I mean aligned to page boundary. Of course they may be larger than 4096 bytes so you make the arrays a multiple of 4096 bytes by allocating extra space if needed. E.g. if the array is 5000 bytes you make it 8192 bytes. See Hristo Iliev's comments in the first link. – Z boson Feb 13 '14 at 08:35
I still don't understand what this has to do with false sharing. False sharing is when you get two variables from the same *cache line*. A cache line is 16 bytes. 16 bytes is just 2 `double`s, and my buffers are more like 200 or 2000 `double`s... the probability of false sharing is vanishingly small. – user541686 Feb 13 '14 at 08:50
@Mehrdad, I did not say it was false sharing. I said it was like false sharing. If one processor has to write to a page shared on another processor then it has the problem of the processor interconnect. On non NUMA systems you don't wont multiple threads writing to the same cache line. On NUMA systems you don't want multiple threads writing to the same page. – Z boson Feb 13 '14 at 09:29
@Mehrdad, actually it is false sharing in a page instead of a cache line. I updated my answer to explain it. – Z boson Feb 13 '14 at 09:57
@Mehrdad, I think the thread migration might be important as well. For example if a thread is writing to a page on one processor then the thread gets migrated to another processor. It would then have to use the processor interconnect to access the page on the processor it had before. Anyway, I don't have a NUMA system so I only know what I read (mostly from Hristo Iliev). I'm interested to see if you find a solution. – Z boson Feb 13 '14 at 10:12
I see, that makes sense, thanks for the info. I'll let you know if I find a solution. – user541686 Feb 13 '14 at 10:23
@Mehrdad, I need to correct someting I said. It's not necessary that the array you write is a multiple of the page size. What's most important is that it is aligned to a page size and that in e.g. a parallel loop that no thread writes to the same page. You can do this e.g. by setting the chunk size as a multiple of the page size e.g. with `schedule(static,4096)` – Z boson Feb 13 '14 at 16:21
I don't believe either of the two first papers in the link applies to modern AMD processors. They were written in 1993 and 1996 respectively. The AMD Opteron processors came out in 2002 or so, and have a different architecture than the systems described in those papers. AMD systems can definitely (efficiently) share data within a page, as long as the data is in different cache-lines. Of course, the heaps that are assigned to each processor socket will be in different pages, and depending on the OS, different functions may need to be used to allocate from "per socket memory". – Mats Petersson Feb 14 '14 at 08:36
1

@MatsPetersson, everything you need to know can be found here http://stackoverflow.com/questions/10850155/openmp-for-schedule and it's only one year old. – Z boson Feb 14 '14 at 15:32

Why would parallelization decrease performance so dramatically?

2 Answers2

Linked