3

We've just taken delivery of a powerful 32-core AMD Opteron server with 128Gb. We have 2 x 6272 CPU's with 16 cores each. We are running a big long-running java task on 30 threads. We have the NUMA optimisations for Linux and java turned on. Our Java threads are mainly using objects that are private to that thread, sometimes reading memory that other threads will be reading, and very very occasionally writing or locking shared objects.

We can't explain why the CPU cores are 25% idle. Below is a dump of "top":

top - 23:06:38 up 1 day, 23 min,  3 users,  load average: 10.84, 10.27, 9.62
Tasks: 676 total,   1 running, 675 sleeping,   0 stopped,   0 zombie
Cpu(s): 64.5%us,  1.3%sy,  0.0%ni, 32.9%id,  1.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132138168k total, 131652664k used,   485504k free,    92340k buffers
Swap:  5701624k total,   230252k used,  5471372k free, 13444344k cached
...
top - 22:37:39 up 23:54,  3 users,  load average: 7.83, 8.70, 9.27
Tasks: 678 total,   1 running, 677 sleeping,   0 stopped,   0 zombie
Cpu0  : 75.8%us,  2.0%sy,  0.0%ni, 22.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 77.2%us,  1.3%sy,  0.0%ni, 21.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 77.3%us,  1.0%sy,  0.0%ni, 21.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 77.8%us,  1.0%sy,  0.0%ni, 21.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 76.9%us,  2.0%sy,  0.0%ni, 21.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 76.3%us,  2.0%sy,  0.0%ni, 21.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 12.6%us,  3.0%sy,  0.0%ni, 84.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  8.6%us,  2.0%sy,  0.0%ni, 89.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  : 77.0%us,  2.0%sy,  0.0%ni, 21.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  : 77.0%us,  2.0%sy,  0.0%ni, 21.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 : 77.6%us,  1.7%sy,  0.0%ni, 20.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 : 75.7%us,  2.0%sy,  0.0%ni, 21.4%id,  1.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 : 76.6%us,  2.3%sy,  0.0%ni, 21.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 : 76.6%us,  2.3%sy,  0.0%ni, 21.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 : 76.2%us,  2.6%sy,  0.0%ni, 15.9%id,  5.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 : 76.6%us,  2.0%sy,  0.0%ni, 21.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu16 : 73.6%us,  2.6%sy,  0.0%ni, 23.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu17 : 74.5%us,  2.3%sy,  0.0%ni, 23.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu18 : 73.9%us,  2.3%sy,  0.0%ni, 23.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu19 : 72.9%us,  2.6%sy,  0.0%ni, 24.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu20 : 72.8%us,  2.6%sy,  0.0%ni, 24.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu21 : 72.7%us,  2.3%sy,  0.0%ni, 25.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu22 : 72.5%us,  2.6%sy,  0.0%ni, 24.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu23 : 73.0%us,  2.3%sy,  0.0%ni, 24.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu24 : 74.7%us,  2.7%sy,  0.0%ni, 22.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu25 : 74.5%us,  2.6%sy,  0.0%ni, 22.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu26 : 73.7%us,  2.0%sy,  0.0%ni, 24.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu27 : 74.1%us,  2.3%sy,  0.0%ni, 23.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu28 : 74.1%us,  2.3%sy,  0.0%ni, 23.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu29 : 74.0%us,  2.0%sy,  0.0%ni, 24.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu30 : 73.2%us,  2.3%sy,  0.0%ni, 24.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu31 : 73.1%us,  2.0%sy,  0.0%ni, 24.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132138168k total, 131711704k used,   426464k free,    88336k buffers
Swap:  5701624k total,   229572k used,  5472052k free, 13745596k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13865 root      20   0  122g 112g 3.1g S 2334.3 89.6  20726:49 java
27139 jayen     20   0 15428 1728  952 S  2.6  0.0   0:04.21 top
27161 sysadmin  20   0 15428 1712  940 R  1.0  0.0   0:00.28 top
   33 root      20   0     0    0    0 S  0.3  0.0   0:06.24 ksoftirqd/7
  131 root      20   0     0    0    0 S  0.3  0.0   0:09.52 events/0
 1858 root      20   0     0    0    0 S  0.3  0.0   1:35.14 kondemand/0

A dump of the java stack confirms that none of the threads are anywhere near the few places where locks are used, nor are they anywhere near any disk or network i/o.

I had trouble finding a clear explanation of what 'top' means by "idle" versus "wait", but I get the impression that "idle" means "no more threads that need to be run" but this doesn't make sense in our case. We're using a "Executors.newFixedThreadPool(30)". There are a large number of tasks pending and each task lasts for 10 seconds or so.

I suspect that the explanation requires a good understanding of NUMA. Is the "idle" state what you see when a CPU is waiting for a non-local access? If not, then what is the explanation?

Tim Cooper
  • 10,023
  • 5
  • 61
  • 77
  • Is it possible to run a scaled down version of your application and view it through something like the Java VisualVM? If you start it up with the application, it's light weight enough and you can observe how long threads are running and what they're blocking on. – pickypg Oct 05 '12 at 03:38
  • I suspect you have some delaying bottleneck despite you limited use of locks. Have you tried running a CPU and Memory profiler on your application to see if any blocking operations show up? – Peter Lawrey Oct 05 '12 at 07:37
  • Peter, we have been dumping the stack traces, with the signal that does not stop the program running, and this indicates that none of the threads are anywhere near the few places where locks are used. – Tim Cooper Oct 06 '12 at 11:28
  • 3
    @some: I don't understand why this question is considered off topic. The FAQ says it must be related to programming or software development. How can a question about getting decent performance out of a multi-core server not be about programming/software development? The solution will be something to do with application-level optimisations or operating-system level configurations, such as data duplication, fixing locks, rearranging data on cache lines. I studied the FAQ and don't know what I can do to fix the question. – Tim Cooper Oct 25 '12 at 11:20
  • 1
    @Kris : can you explain why this question is off topic? See the comment immediately above. – Tim Cooper Nov 09 '12 at 04:39

1 Answers1

1

It could be a number of things:

  • It could be contention between threads over the access to shared data. This might take the form of lock contention, or extra memory traffic due to read or write barriers, though the latter is unlikely to produce these symptoms.

  • You are leaking worker threads; e.g. they are occasionally dying and not being replaced.

  • There could be a bottleneck is in the executor itself; e.g. it may not be responding quickly enough to tasks finishing by scheduling the next task.

  • The bottleneck could be the garbage collector, especially if you don't have parallel collection enabled.


This page talks about Java's NUMA enhancements, and mentions the NUMA-aware GC switch. Try that. Also check out the other GC tuning advice on that page.

This question explains the process states: In linux, what do all the values in the "top" command mean?.

I think that the difference between "wa" and "idle" time in the processor summary is that "wa" means that the processor has threads in "D" state; i.e. waiting for disk I/O. By contrast, a processor where all threads are waiting in "S" state would be counted as "idle". (From this perspective, a thread that is waiting on a lock would be in S state.)

You could also try top -H which shows the threads individually.

Community
  • 1
  • 1
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Thanks, but none of those 4 suggestions pan out. (I can elaborate if you want). That question you pointed me to about "values in "top"" doesn't define "wait" or "idle" or explain which of the described states they're synonyms for. – Tim Cooper Oct 06 '12 at 11:31
  • @TimCooper - 1) Until you have determined what the problem really is, I don't see how you can eliminate all of those possibilities. 2) I didn't say it would. That link explains process states ... which you need to understand to read the following paragraph. – Stephen C Oct 06 '12 at 14:12
  • Apologies about the explanation of process states. About the 4 ideas: (1) is a possibility, although the stack trace shows nothing anywhere near the few places we use locks, so (1b) i.e. memory traffic is most likely. (2) the stack trace shows all 30 worker threads intact deep into the process; (3) we're using the standard Java fixed thread executor pool; (4) good point - I need to check about parallel garbage collection. But I would expect parallel garbage collection to use all the CPU cores to full capacity...? – Tim Cooper Oct 07 '12 at 22:53
  • About your possibility (1b), do you know what state a CPU is in while waiting for a NUMA non-local memory access? – Tim Cooper Oct 07 '12 at 22:54
  • *"But I would expect parallel garbage collection to use all the CPU cores to full capacity...?"* - Not necessarily. I can imagine reasons why that might be a bad idea. – Stephen C Oct 07 '12 at 23:40
  • 1
    *"... do you know what state a CPU is in while waiting for a NUMA non-local memory access?"* I don't **know**, but my educated guess would be "R". For it to be any other state, the OS would need to be involved in the wait logic, and that would result in a tremendous slow down. – Stephen C Oct 07 '12 at 23:46