3

I have an application which connects to Hazelcast. Lately i found that the requests to hazelcast eventually started becoming unresponsive, hence, i took a thread dump of the Hazelcast process. While analyzing thread dumps from development and production environment i found that the threads waiting for task in the pool are in different states in different environments.

While on production servers, the threads are blocked (337 out of 500). On development environment, no threads are blocked instead (50% as runnable and 50% as waiting out of 60 threads).

Are those blocking threads waiting on synchronized block which is held indefinitely by some threads? Are 500 threads too many (I got a warning by some analyzers)? Is this causing my application to become unresponsive?

What could be a possible cause of this state and how to resolve this?

Thread dumps (Production):

Thread 120713: (state = BLOCKED)
     - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
     - java.util.concurrent.ForkJoinPool.awaitWork(java.util.concurrent.ForkJoinPool$WorkQueue, int) @bci=350, line=1824 (Compiled frame)
     - java.util.concurrent.ForkJoinPool.runWorker(java.util.concurrent.ForkJoinPool$WorkQueue) @bci=44, line=1693 (Interpreted frame)
     - java.util.concurrent.ForkJoinWorkerThread.run() @bci=24, line=157 (Interpreted frame)

Thread 120743: (state = BLOCKED)
    - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
    - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=175 (Compiled frame)
    - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2039 (Compiled frame)
    - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Compiled frame)
    - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=149, line=1074 (Compiled frame)

Thread 120743: (state = BLOCKED)
    - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
    - java.util.concurrent.locks.LockSupport.park() @bci=5, line=304 (Compiled frame)
    - com.hazelcast.internal.util.concurrent.MPSCQueue.takeAll() @bci=83, line=231 (Compiled frame)
    - com.hazelcast.internal.util.concurrent.MPSCQueue.take() @bci=12, line=153 (Compiled frame)
    - com.hazelcast.client.spi.impl.ClientResponseHandlerSupplier$ResponseThread.doRun() @bci=17, line=164 (Compiled

Thread 128753: (state = BLOCKED)
    - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
    - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=215 (Compiled frame)
    - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(long) @bci=78, line=2078 (Compiled frame)
    - java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take() @bci=124, line=1093 (Compiled frame)
    - java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take() @bci=1, line=809 (Compiled frame)

Thread dumps from development env:

java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000006c1a1bc38> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
ashish.g
  • 525
  • 11
  • 25

2 Answers2

4

Thread states - here is a little explanation of the thread states.

NEW The thread has not yet started.

RUNNABLE The thread is executing in the JVM.

BLOCKED The thread is blocked waiting for a monitor lock.

WAITING The thread is waiting indefinitely for another thread to perform a particular action.

TIMED_WAITING The thread is waiting for another thread to perform an action for up to a specified waiting time.

TERMINATED The thread has exited.

The BLOCKED state should be concerning if it's there for a long period for the same threads. This of course depends on your case - how you process data, how you create threads (and thread pools), what are your critical sections and how all that interacts with each other.

Single thread dump of the production is not enough - you should take several dumps and - compare what happens and - for how long the threads are running/waiting - is this occurs on high load or after high load - does your thread count increases over long time, etc.

So there is no way to tell that 500 blocked threads at this particular point in time is good or bad but for sure it's concerning. One thread takes around ~2MB just to initialize & allocate so it's 1GB of memory.

It's highly likely that there are some critical sections that are held by some threads that cause your problem and unresponsiveness of your application. You can potentially have some really complex situation reading from queues using blocking methods, etc.

Possible course of action:

  • Make several dumps and compare - what has changed? What threads are still blocked?
  • Check if you can pinpoint the invocations in the blocked threads (is somewhere your package prefix or java's/hazelcast's packges only) in the stacktrace in the dump.
  • Check with tracking tools (flight-recorder / jvisualvm) the growth of the threads and when they (which are getting blocked) are created - what the app is doing at that moment?
  • Analyse your codebase on the potential misuse of the blocking calls and synchronized methods/uses.
  • Check thread pools maximum sizes & worker queue implementations and strategies when the limit is reached (e.g. to understand look at implementations of RejectedExecutionHandler)
kkmazur
  • 476
  • 2
  • 6
3

These thread dumps cannot be reasonably compared, since they were obtained in different ways. The first one is taken in "forced" mode (-F) with Serviceability Agent. The second one is a "normal" dump taken through Attach API. The difference is explained here.

The meaning of the output also differs. "Normal" dump shows the state of java.lang.Thread object, while "forced" dump shows the state of the corresponding VM thread. From the JVM point of view, a thread can be in one of IN_NATIVE, IN_VM, IN_JAVA states, in a transition state or in BLOCKED state. BLOCKED basically means any non-runnable state, including when thread is sleeping, waiting or parked.

In your first dump, BLOCKED threads are inside Unsafe.park method - seems they are just idle and unlikely cause problems.

WAITING or TIMED_WAITING are values of Java level Thread.State. You can see them only in a "normal" dump, i.e. taken without -F option.

When you can't take a "normal" dump, this usually means that the target JVM is busy with a long running safepoint operation (for example, Full GC), or a process does not receive CPU time (for instance, it runs out of memory and starts swapping). OS level profiler like perf can be useful in such cases.

apangin
  • 92,924
  • 10
  • 193
  • 247