2

Garbage collector threads appear to get starved waiting for Heap_lock. Here is a stack trace of a GC thread

+   5480 _pthread_start  (in libsystem_pthread.dylib) + 168  [0x7fff9255191a]
+     5480 _pthread_body  (in libsystem_pthread.dylib) + 131  [0x7fff9255199d]
+       5480 java_start(Thread*)  (in libjvm.dylib) + 246  [0x110e82cd6]
+         5480 YieldingFlexibleGangWorker::loop()  (in libjvm.dylib) + 419  [0x110fbf8fb]
+           5480 Monitor::wait(bool, long, bool)  (in libjvm.dylib) + 375  [0x110e5c859]
+             5480 Monitor::IWait(Thread*, long)  (in libjvm.dylib) + 168  [0x110e5c5f0]
+               5480 ParkCommon(ParkEvent*, long)  (in libjvm.dylib) + 42  [0x110e5bd1a]
+                 5480 os::PlatformEvent::park()  (in libjvm.dylib) + 192  [0x110e7ea72]
+                   5480 _pthread_cond_wait  (in libsystem_pthread.dylib) + 767  [0x7fff92552728]
+                     5480 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff98080db6]

My Java app spawns around 300 threads. They also starve. Here is a stack trace example:

+             5480 java_lang_String::create_from_str(char const*, Thread*)  (in libjvm.dylib) + 59  [0x110cf1ab9]
+               5480 java_lang_String::basic_create(int, Thread*)  (in libjvm.dylib) + 38  [0x110cef8a0]
+                 5480 InstanceKlass::allocate_instance(Thread*)  (in libjvm.dylib) + 31  [0x110cbd93f]
+                   5480 CollectedHeap::obj_allocate(KlassHandle, int, Thread*)  (in libjvm.dylib) + 38  [0x110cbff8a]
+                     5480 CollectedHeap::common_mem_allocate_noinit(KlassHandle, unsigned long, Thread*)  (in libjvm.dylib) + 140  [0x110ab2484]
+                       5480 GenCollectorPolicy::mem_allocate_work(unsigned long, bool, bool*)  (in libjvm.dylib) + 475  [0x110b9f313]
+                         5480 Monitor::lock(Thread*)  (in libjvm.dylib) + 132  [0x110e5c4e6]
+                           5480 Monitor::ILock(Thread*)  (in libjvm.dylib) + 310  [0x110e5c3dc]
+                             5480 ParkCommon(ParkEvent*, long)  (in libjvm.dylib) + 42  [0x110e5bd1a]
+                               5480 os::PlatformEvent::park()  (in libjvm.dylib) + 192  [0x110e7ea72]
+                                 5480 _pthread_cond_wait  (in libsystem_pthread.dylib) + 767  [0x7fff92552728]
+                                   5480 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff98080db6]

The starvation starts after a few hours of running my app.

I tried three collectors: UseG1GC, UseParallelGC, and UseConcMarkSweepGC with default settings, and still get starvation (have not tried the serial GC).

I am using Java 1.8.0_92 (build 1.8.0_92-b14, Java HotSpot(TM) 64-Bit Server VM, build 25.92-b14, mixed mode) on OSX 10.11.5.

There is enough heap space Xmx6G; I see no full GC. I increased stack for GC threads XX:VMThreadStackSize to 10MB, but get same starvation. I also used an older Java, same starvation. I do not see any fatal error log (http://www.oracle.com/technetwork/java/javase/felog-138657.html).

Please suggest how to debug this starvation.

UPDATE June 19, 2016:

I grouped hanging threads by the stack trace, which you can see

Hang in GangWorker::loop()

5347 Thread_669608
+ 5347 thread_start  (in libsystem_pthread.dylib) + 13  [0x7fff930c3351]
+   5347 _pthread_start  (in libsystem_pthread.dylib) + 168  [0x7fff930c591a]
+     5347 _pthread_body  (in libsystem_pthread.dylib) + 131  [0x7fff930c599d]
+       5347 java_start(Thread*)  (in libjvm.dylib) + 246  [0x112485bbe]
+         5347 GangWorker::loop()  (in libjvm.dylib) + 179  [0x1125c194f]
+           5347 Monitor::wait(bool, long, bool)  (in libjvm.dylib) + 375  [0x11245f639]
+             5347 Monitor::IWait(Thread*, long)  (in libjvm.dylib) + 168  [0x11245f3d0]
+               5347 ParkCommon(ParkEvent*, long)  (in libjvm.dylib) + 42  [0x11245eafa]
+                 5347 os::PlatformEvent::park()  (in libjvm.dylib) + 192  [0x11248195a]
+                   5347 _pthread_cond_wait  (in libsystem_pthread.dylib) + 767  [0x7fff930c6728]
+                     5347 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff8dbbcdb6]

Hang in ConcurrentG1RefineThread::wait_for_completed_buffers()

5347 Thread_669627
+ 5347 thread_start  (in libsystem_pthread.dylib) + 13  [0x7fff930c3351]
+   5347 _pthread_start  (in libsystem_pthread.dylib) + 168  [0x7fff930c591a]
+     5347 _pthread_body  (in libsystem_pthread.dylib) + 131  [0x7fff930c599d]
+       5347 java_start(Thread*)  (in libjvm.dylib) + 246  [0x112485bbe]
+         5347 ConcurrentG1RefineThread::run()  (in libjvm.dylib) + 90  [0x1121c054e]
+           5347 ConcurrentG1RefineThread::wait_for_completed_buffers()  (in libjvm.dylib) + 65  [0x1121c08e5]
+             5347 Monitor::wait(bool, long, bool)  (in libjvm.dylib) + 375  [0x11245f639]
+               5347 Monitor::IWait(Thread*, long)  (in libjvm.dylib) + 168  [0x11245f3d0]
+                 5347 ParkCommon(ParkEvent*, long)  (in libjvm.dylib) + 42  [0x11245eafa]
+                   5347 os::PlatformEvent::park()  (in libjvm.dylib) + 192  [0x11248195a]
+                     5347 _pthread_cond_wait  (in libsystem_pthread.dylib) + 767  [0x7fff930c6728]
+                       5347 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff8dbbcdb6]

Hang of compiler threads

5347 Thread_669680: Java: C2 CompilerThread0
+ 5347 thread_start  (in libsystem_pthread.dylib) + 13  [0x7fff930c3351]
+   5347 _pthread_start  (in libsystem_pthread.dylib) + 168  [0x7fff930c591a]
+     5347 _pthread_body  (in libsystem_pthread.dylib) + 131  [0x7fff930c599d]
+       5347 java_start(Thread*)  (in libjvm.dylib) + 246  [0x112485bbe]
+         5347 JavaThread::run()  (in libjvm.dylib) + 450  [0x112566344]
+           5347 JavaThread::thread_main_inner()  (in libjvm.dylib) + 155  [0x112564c57]
+             5347 CompileBroker::compiler_thread_loop()  (in libjvm.dylib) + 376  [0x1121b89ce]
+               5347 CompileQueue::get()  (in libjvm.dylib) + 122  [0x1121b87ac]
+                 5347 Monitor::wait(bool, long, bool)  (in libjvm.dylib) + 222  [0x11245f5a0]
+                   5347 Monitor::IWait(Thread*, long)  (in libjvm.dylib) + 168  [0x11245f3d0]
+                     5347 os::PlatformEvent::park(long)  (in libjvm.dylib) + 404  [0x112482844]
+                       5347 _pthread_cond_wait  (in libsystem_pthread.dylib) + 767  [0x7fff930c6728]
+                         5347 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff8dbbcdb6]

In addition the remaining threads have these stack traces (one line per thread)

 _OSSpinLockLockSlow ! syscall_thread_switch _OSSpinLockLockSlow _pthread_lookup_thread pthread_kill os::SuspendedThreadTask::internal_do_task() os::SuspendedThreadTask::run() ThreadSampleClosure::sample_thread(JavaThread*,     5347 Thread_669693
 __accept BsdAttachListener::dequeue() AttachListener::dequeue() attach_listener_thread_entry(JavaThread*, JavaThread::thread_main_inner() JavaThread::run() java_start(Thread*) _pthread_body     5347 Thread_669711: Java: Attach Listener
 __psynch_cvwait _pthread_cond_wait Parker::park(bool, Unsafe_Park ??? ???       5347 Thread_669715: Java: RMI Scheduler(0)
 __psynch_cvwait _pthread_cond_wait Parker::park(bool, Unsafe_Park ??? ??? ??? ???     5347 Thread_669777: Java: log4j2.jmx.notif1
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ObjectMonitor::wait(long, ObjectSynchronizer::wait(Handle, JVM_MonitorWait ??? ???     5347 Thread_669603
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ObjectMonitor::wait(long, ObjectSynchronizer::wait(Handle, JVM_MonitorWait ??? ???     5347 Thread_669657: Java: Reference Handler
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ObjectMonitor::wait(long, ObjectSynchronizer::wait(Handle, JVM_MonitorWait ??? ???     5347 Thread_669658: Java: Finalizer
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ParkCommon(ParkEvent*, Monitor::ILock(Thread*) Monitor::lock(Thread*) G1CollectedHeap::attempt_allocation_slow(unsigned G1CollectedHeap::attempt_allocation(unsigned     5347 Thread_669669: Java: JFR request timer
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ParkCommon(ParkEvent*, Monitor::ILock(Thread*) Monitor::lock(Thread*) G1CollectedHeap::attempt_allocation_slow(unsigned G1CollectedHeap::attempt_allocation(unsigned     5347 Thread_669712: Java: RMI TCP Accept-0
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ParkCommon(ParkEvent*, Monitor::ILock(Thread*) Monitor::lock(Thread*) G1CollectedHeap::attempt_allocation_slow(unsigned G1CollectedHeap::attempt_allocation(unsigned     5347 Thread_669720: Java: JMX server connection timeout 29
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ParkCommon(ParkEvent*, Monitor::ILock(Thread*) Monitor::lock(Thread*) G1CollectedHeap::attempt_allocation_slow(unsigned G1CollectedHeap::attempt_allocation(unsigned     5347 Thread_683755: Java: RMI TCP Connection(65)-192.168.1.167
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ParkCommon(ParkEvent*, Monitor::ILock(Thread*) Monitor::lock(Thread*) G1CollectedHeap::attempt_allocation_slow(unsigned G1CollectedHeap::attempt_allocation(unsigned     5347 Thread_813902: Java: RMI TCP Connection(634)-192.168.1.167
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ParkCommon(ParkEvent*, Monitor::ILock(Thread*) Monitor::lock(Thread*) SafepointSynchronize::begin() VMThread::loop()     5347 Thread_669656
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ParkCommon(ParkEvent*, Monitor::ILock(Thread*) Monitor::lock_without_safepoint_check() SafepointSynchronize::block(JavaThread*) ThreadStateTransition::trans_and_fence(JavaThreadState,     5347 Thread_669676: Java: VM JFR Buffer Thread
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ParkCommon(ParkEvent*, Monitor::IWait(Thread*, Monitor::wait(bool, ConcurrentMarkThread::sleepBeforeNextCycle() ConcurrentMarkThread::run()     5347 Thread_669645
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ParkCommon(ParkEvent*, Monitor::IWait(Thread*, Monitor::wait(bool, ServiceThread::service_thread_entry(JavaThread*, JavaThread::thread_main_inner()     5347 Thread_669692: Java: Service Thread
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ParkCommon(ParkEvent*, Monitor::IWait(Thread*, Monitor::wait(bool, SurrogateLockerThread::loop() JavaThread::thread_main_inner()     5347 Thread_669667: Java: Surrogate Locker Thread (Concurrent GC)
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ParkCommon(ParkEvent*, Monitor::IWait(Thread*, Monitor::wait(bool, SuspendibleThreadSet::join() ConcurrentG1RefineThread::sample_young_list_rs_lengths()     5347 Thread_669626
 __semwait_signal pthread_join ContinueInNewThread0 ContinueInNewThread JVMInit JLI_Launch main apple_main     5347 Thread_669600
 __sigsuspend SR_handler(int, _sigtramp vframeStreamCommon::fill_from_frame() _pthread_lookup_thread pthread_kill closefd ???     5347 Thread_671069: Java: DelayClose
 kevent_qos _dispatch_mgr_invoke _dispatch_mgr_thread          5347 Thread_669601   DispatchQueue_2: com.apple.libdispatch-manager  (serial)
 mach_msg_trap mach_msg __CFRunLoopServiceMachPort __CFRunLoopRun CFRunLoopRunSpecific CreateExecutionEnvironment JLI_Launch main     5347 Thread_669599   DispatchQueue_1: com.apple.main-thread  (serial)
 semaphore_wait_trap check_pending_signals(bool) signal_thread_entry(JavaThread*, JavaThread::thread_main_inner() JavaThread::run() java_start(Thread*) _pthread_body _pthread_start     5347 Thread_669668: Java: Signal Dispatcher
 __psynch_cvwait _pthread_cond_wait os::PlatformEvent::park() ParkCommon(ParkEvent*, Monitor::ILock(Thread*) Monitor::lock(Thread*) G1CollectedHeap::attempt_allocation_slow(unsigned 5347 Thread_815183: Java: RMI TCP Connection(635)-192.168.1.167

I omitted my 300 threads, because Java is not at a safe point (jstack does not work), so the starvation cannot be because of my (user) threads. Starvation must be caused by VM threads (all of which I listed above with stack traces).

UPDATE June 24, 2016:

I did jstack -F, and my threads are blocked, some on memory operations, like this one

Thread 200451: (state = BLOCKED)
 - java.util.Arrays.copyOf(char[], int) @bci=1, line=3332 (Compiled frame)
 - java.lang.AbstractStringBuilder.expandCapacity(int) @bci=43, line=137 (Compiled frame)
 - java.lang.AbstractStringBuilder.ensureCapacityInternal(int) @bci=12, line=121 (Compiled frame)
 - java.lang.AbstractStringBuilder.append(java.lang.String) @bci=21, line=421 (Compiled frame)
 - java.lang.StringBuilder.append(java.lang.String) @bci=2, line=136 (Compiled frame)

but jstack does not show garbage collector threads. To get this, I run the sample command on that same process, and just like before you can see starved GC threads:

+ 5229 ???  (in <unknown binary>)  [0x11064ae07]
+   5229 OptoRuntime::new_array_nozero_C(Klass*, int, JavaThread*)  (in libjvm.dylib) + 54  [0x10f8da854]
+     5229 TypeArrayKlass::allocate_common(int, bool, Thread*)  (in libjvm.dylib) + 138  [0x10f975406]
+       5229 CollectedHeap::common_mem_allocate_noinit(KlassHandle, unsigned long, Thread*)  (in libjvm.dylib) + 105  [0x10f4b2bbb]
+         5229 CollectedHeap::allocate_from_tlab_slow(KlassHandle, Thread*, unsigned long)  (in libjvm.dylib) + 283  [0x10f59e2af]
+           5229 G1CollectedHeap::allocate_new_tlab(unsigned long)  (in libjvm.dylib) + 28  [0x10f65590e]
+             5229 G1CollectedHeap::attempt_allocation(unsigned long, unsigned int*, unsigned int*)  (in libjvm.dylib) + 182  [0x10f65e42a]
+               5229 G1CollectedHeap::attempt_allocation_slow(unsigned long, unsigned char, unsigned int*, unsigned int*)  (in libjvm.dylib) + 339  [0x10f6576ab]
+                 5229 Monitor::lock(Thread*)  (in libjvm.dylib) + 132  [0x10f85f2c6]
+                   5229 Monitor::ILock(Thread*)  (in libjvm.dylib) + 310  [0x10f85f1bc]
+                     5229 ParkCommon(ParkEvent*, long)  (in libjvm.dylib) + 42  [0x10f85eafa]
+                       5229 os::PlatformEvent::park()  (in libjvm.dylib) + 192  [0x10f88195a]
+                         5229 _pthread_cond_wait  (in libsystem_pthread.dylib) + 767  [0x7fff930c6728]
+                           5229 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff8dbbcdb6]

The starvation went away when I run my app with just 30 threads (instead of 300).

We need to find out which lock the __psynch_cvwait of garbage collector waits on, but jstack does not say this. Probably code inspection is the best bet.

Greg
  • 21
  • 3
  • 1
    Whenever you have a lot of threads waiting on a lock, the first question you should ask is "what holds the lock and what is it doing?" – kdgregory Jun 18 '16 at 12:19
  • Also, what does the heap look like? Are all generations full or just one? If you're approaching capacity (which should trigger a full GC), are all the objects live (in which case you may be an a GC overhead too high situation). – kdgregory Jun 18 '16 at 12:22
  • And finally, what's the state of your overall system? Are you bumping against OS memory/commit or process/thread limits? – kdgregory Jun 18 '16 at 12:26
  • Enable detailed logging with `-XX:+PrintGCDetails -XX:+PrintSafepointStatistics –XX:PrintSafepointStatisticsCount=1 -XX:+PrintJNIGCStalls` – the8472 Jun 19 '16 at 03:56
  • Regarding "who is holding the lock": I cannot get "WAITING on..." because jconsole does not work. Java is not in a safe point. I got the stack traces I reported above with a lower level command called "sample" on OSX. I do not know how to find out which lock "__psynch_cvwait" is blocked on. Do you know? Memory is plenty. The heap was 6GB, and only about 1GB was used by my app. When I used G1, "G1 Eden Space" and "G1 Old Gen" were each < 1GB, "G1 Survivor Space" was < 80MB, "Metaspace" < 40MB, "Code Cache" < 50MB, "Compressed Class Space" < 4MB. Mem and proc limits are plenty remaining – Greg Jun 19 '16 at 05:13
  • Regarding GC logging, I already log (-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -verbose:gc). The last entry was "2016-06-19T00:46:06.242-0900: 19994.234: [GC pause (G1 Evacuation Pause) (young), 0.0125470 secs] ..... [Times: user=0.06 sys=0.01, real=0.01 secs]", which is the last log entry, no more later. I will add the other flags you requested. – Greg Jun 19 '16 at 05:16
  • I suspect __psynch_cvwait is blocked on Heap_lock, as I reported in the title, which I am guessing by looking at the code https://github.com/srisatish/openjdk/blob/master/hotspot/src/share/vm/gc_implementation/g1/g1CollectedHeap.cpp – Greg Jun 19 '16 at 05:55
  • `jstack -F -m -l` may be able to dump traces even when not at a safepoint – the8472 Jun 20 '16 at 05:42
  • @the8472 I reported jstack -F above. The -m complained "Not available on Darwin", and the -l would hang. – Greg Jun 24 '16 at 06:20
  • maybe you're running into some resource limits (memory mappings, threads counts, stack size) that kills a GC thread through an error that is not caught properly, which then leaves a lock locked. – the8472 Jun 24 '16 at 11:33
  • and I think if it stalls in the middle of a GC there might be some partial GC output in the logs. if it hangs before actually starting the GC that might point to a problem with reaching a safepoint – the8472 Jun 24 '16 at 11:39

0 Answers0