many-core CPU's: Programming techniques to avoid disappointing scalability

Question

We've just bought a 32-core Opteron machine, and the speedups we get are a little disappointing: beyond about 24 threads we see no speedup at all (actually gets slower overall) and after about 6 threads it becomes significantly sub-linear.

Our application is very thread-friendly: our job breaks down into about 170,000 little tasks which can each be executed separately, each taking 5-10 seconds. They all read from the same memory-mapped file of size about 4Gb. They make occasional writes to it, but it might be 10,000 reads to each write - we just write a little bit of data at the end of each of the 170,000 tasks. The writes are lock-protected. Profiling shows that the locks are not a problem. The threads use a lot of JVM memory each in non-shared objects and they make very little access to shared JVM objects and of that, only a small percentage of accesses involve writes.

We're programming in Java, on Linux, with NUMA enabled. We have 128Gb RAM. We have 2 Opteron CPU's (model 6274) of 16 cores each. Each CPU has 2 NUMA nodes. The same job running on an Intel quad-core (i.e. 8 cores) scaled nearly linearly up to 8 threads.

We've tried replicating the read-only data to have one-per-thread, in the hope that most lookups can be local to a NUMA node, but we observed no speedup from this.

With 32 threads, 'top' shows the CPU's 74% "us" (user) and about 23% "id" (idle). But there are no sleeps and almost no disk i/o. With 24 threads we get 83% CPU usage. I'm not sure how to interpret 'idle' state - does this mean 'waiting for memory controller'?

We tried turning NUMA on and off (I'm referring to the Linux-level setting that requires a reboot) and saw no difference. When NUMA was enabled, 'numastat' showed only about 5% of 'allocation and access misses' (95% of cache misses were local to the NUMA node). [Edit:] But adding "-XX:+useNUMA" as a java commandline flag gave us a 10% boost.

One theory we have is that we're maxing out the memory controllers, because our application uses a lot of RAM and we think there are a lot of cache misses.

What can we do to either (a) speed up our program to approach linear scalability, or (b) diagnose what's happening?

Also: (c) how do I interpret the 'top' result - does 'idle' mean 'blocked on memory controllers'? and (d) is there any difference in the characteristics of Opteron vs Xeon's?

possible duplicate of [Why are my Opteron cores running at only 75% capacity each? (25% CPU idle)](http://stackoverflow.com/questions/12738991/why-are-my-opteron-cores-running-at-only-75-capacity-each-25-cpu-idle) — Will Hartung, Nov 20 '12 at 01:56
Aren't there any hardware-level profilers for Java? If your code was written in C, C++, Fortran or other compiled language I would have pointed you to something like `likwid` or `PAPI` in order to get hardware counter readings. Note that unlike Xeons, with Bulldozer each pair of cores share a lot of resources - instruction decoders, L1 instruction cache and L2 cache, FP scheduler, 2 FMACs/AVX engine, etc. Then each Bulldozer processor is also a NUMA device itself - it is basically two processors in a single package with a HT link between them. — Hristo Iliev, Nov 20 '12 at 08:37

score 3 · Answer 1 · edited May 23 '17 at 12:18

I also have a 32 core Opteron machine, with 8 NUMA nodes (4x6128 processors, Mangy Cours, not Bulldozer), and I have faced similar issues.

I think the answer to your problem is hinted at by the 2.3% "sys" time shown in top. In my experience, this sys time is the time the system spends in the kernel waiting for a lock. When a thread can't get a lock it then sits idle until it makes its next attempt. Both the sys and idle time are a direct result of lock contention. You say that your profiler is not showing locks to be the problem. My guess is that for some reason the code causing the lock in question is not included in the profile results.

In my case a significant cause of lock contention was not the processing I was actually doing but the work scheduler that was handing out the individual pieces of work to each thread. This code used locks to keep track of which thread was doing which piece of work. My solution to this problem was to rewrite my work scheduler avoiding mutexes, which I have read do not scale well beyond 8-12 cores, and instead use gcc builtin atomics (I program in C on Linux). Atomic operations are effectively a very fine grained lock that scales much better with high core counts. In your case if your work parcels really do take 5-10s each it seems unlikely this will be significant for you.

I also had problems with malloc, which suffers horrible lock issues in high core count situations, but I can't, off the top of my head, remember whether this also led to sys & idle figures in top, or whether it just showed up using Mike Dunlavey's debugger profiling method (How can I profile C++ code running in Linux?). I suspect it did cause sys & idle problems, but I draw the line at digging through all my old notes to find out :) I do know that I now avoid runtime mallocs as much as possible.

My best guess is that some piece of library code you are using implements locks without your knowledge, is not included in your profiling results, and is not scaling well to high core-count situations. Beware memory allocators!

Thanks for a good answer. Initially I was pretty confident that locks are unrelated to the problem, but more recently I've gotten frustrated at clearly erroneous profiling type information, e.g. a profiler not knowing where in a 100-line method it is. So you might be right. — Tim Cooper, Apr 12 '13 at 02:51

score 2 · Answer 2 · answered Apr 10 '13 at 21:50

I'm sure the answer will lie in a consideration of the hardware architecture. You have to think of multi core computers as if they were individual machines connected by a network. In fact that's all that Hypertransport and QPI are.

I find that to solve these scalability problems you have to stop thinking in terms of shared memory and start adopting the philosophy of Communicating Sequential Processes. It means thinking very differently, ie imagine how you would write the software if your hardware was 32 single core machines connected by a network. Modern (and ancient) CPU architectures are not designed to give unfettered scaling of the sort you're after. They are designed to allow many different processes to get on with processing their own data.

Like everything else in computing these things go in fashions. CSP dates back to the 1970s, but the very modern and Java derived Scala is a popular embodiment of the concept. See this section on Scala concurrency on Wikipedia.

What the philosophy of CSP does is force you to design a data distribution scheme that fits your data and the problem you're solving. That's not necessarily easy, but if you manage it then you have a solution that will scale very well indeed. Scala may make it easier to develop.

Personally I do everything in CSP and in C. It's allowed me to develop a signal processing application that scales perfectly linearly from 8 cores to several thousand cores (the limit being how big my room is).

The first thing you're going to have to do is actually use NUMA. It isn't a magic setting that you turn on, you have to exploit it in your software's architecture. I don't know about Java, but in C one would bind a memory allocation to a specific core's memory controller (aka memory affinity), and similarly for threads (core affinity) in cases where the OS doesn't get the hint.

I presume that your data doesn't break down into 32 neat, discrete chunks? It's difficult to give advice without knowing exactly the data flows implicit in your program. But think about it in terms of data flow. Draw it out even; Data Flow Diagrams are useful for this (another ancient graphical formal notation). If your picture shows all your data going through a single object (eg through a single memory buffer) then it's going to be slow...

Thanks for the good answer. We're programming in Java and we seem to have very little control over placement of data, although we probably can control thread affinity. In desperation we tried replicating large amounts of read-only data in the hope that pages would migrate to (or at least be initially placed in) the NUMA node of different cores. But there was no observable improvement in speed. — Tim Cooper, Apr 12 '13 at 02:47

score 1 · Answer 3 · answered Nov 30 '12 at 13:26

I assume you have optimized your locks, and synchronization made a minimum. In such a case, it still depends a lot on what libraries you are using to program in parallel.

One issue that can happen even if you have no synchronization issue, is memory bus congestion. This is very nasty and difficult to get rid of. All I can suggest is somehow make your tasks bigger and create fewer tasks. This depends highly on the nature of your problem. Ideally you want as many tasks as the number of cores/threads, but this is not easy (if possible) to achieve.

Something else that can help is to give more heap to your JVM. This will reduce the need to run Garbage Collector frequently, and speeds up a little.

does 'idle' mean 'blocked on memory controllers'

No. You don't see that in top. I mean if the CPU is waiting for memory access, it will be shown as busy. If you have idle periods, it is either waiting for a lock, or for IO.

score 1 · Answer 4 · answered Apr 07 '19 at 23:55

I'm the Original Poster. We think we've diagnosed the issue, and it's not locks, not system calls, not memory bus congestion; we think it's level 2/3 CPU cache contention.

To reiterate, our task is embarrassingly parallel so it should scale well. However, one thread has a large amount of CPU cache it can access, but as we add more threads, the amount of CPU cache each process can access gets lower and lower (the same amount of cache divided by more processes). Some levels on some architectures are shared between cores on a die, some are even shared between dies (I think), and it may help to get "down in the weeds" with the specific machine you're using, and optimise your algorithms, but our conclusion is that there's not a lot we can do to achieve the scalability we thought we'd get.

We identified this as the cause by using 2 different algorithms. The one which accesses more level 2/3 cache scales much worse than the one which does more processing with less data. They both make frequent accesses to the main data in main memory.

Perhaps there's something to the algorithms which makes it infeasible- but the high parallelization, reading large amounts of data and only writing at the end, and concerns about cache contention sound like something a GPGPU/OpenCL/HSA stack could address better. — user515655, Oct 18 '20 at 09:03

score 0 · Answer 5 · answered Apr 06 '19 at 17:58

If you haven't tried that yet: Look at hardware-level profilers like Oracle Studio has (for CentOS, Redhat, and Oracle Linux) or if you are stuck with Windows: Intel VTune. Then start looking at operations with suspiciously high clocks per instruction metrics. Suspiciously high mean a lot higher than the same code on a single-numa, single-L3-cache machine (like current Intel desktop CPUs).

many-core CPU's: Programming techniques to avoid disappointing scalability

5 Answers5

Linked