1

Java app's log time suddenly jumped from 22:20 to 00:33 of the next day. Before 22:20 or after 00:33 the app run correctly. Looks like the app was frozen between 22:20 to 00:33.

There's a thread pool in my app. I use DelayQueue to handle timeout business tasks(Runnables). So I submit a task to the thread pool to poll timeout business task and handle it

delayFuture = pool.submit(() -> {
    logger.debug("delayed task execution thread: {}", Thread.currentThread().getName());
    while (true) {
        try {
            AbstractTask task = delayQueue.poll(5, TimeUnit.SECONDS);
            if (task != null && !task.isDone()) {
                task.setOvertime(true);
                task.getExecutingThread().interrupt();

                pool.submit(() -> {
                    task.fireTimeout();
                });
            }
        } catch (Exception e) {
            logger.error("delay task error", e);
        }
    }
});

All business tasks do similar work with fixed delay. When business tasks are executed, they finish several steps one by one in one thread of the thread pool. But the log showed that some tasks "blocked" in one same step doing

LinkedBlockingQueue.poll(timeout, unit);

It's really weird that the log time jumped from 22:20 to 00:33 and recovered normally again.

2019-02-16 22:20:20,627 INFO  (biz_thread_64:com.some.company.task.AbstractTask.interruptProcess:31) - device some_device sent control cmd response
2019-02-16 22:20:20,647 INFO  (biz_thread_64:com.some.company.task.AbstractTask.interruptProcess:36) - user some_user received device some_device control cmd response

2019-02-16 22:20:25,809 INFO  (biz_thread_64:com.some.company.task.AbstractTask.interruptProcess:38) - device some_device disconnected, cancel device heartbeat: true
// the next line will be loged after LinkedBlockingQueue.poll(timeout, unit);
2019-02-17 00:33:18,751 INFO  (biz_thread_64:com.some.company.task.AbstractTask.interruptProcess:46) - user some_user received device some_device offline message

2019-02-17 00:33:18,747 INFO  (biz_thread_56:com.some.company.task.TaskManager.lambda$null$0:54) - fire task timeout operation: user get device offline notification, devMac=some_device, userToken=some_user
2019-02-17 00:33:18,951 INFO  (biz_thread_56:com.some.company.task.AbstractTask.resetState:174) - device some_device disconnected, cancel device heartbeat: false
2019-02-17 00:33:18,951 INFO  (biz_thread_56:com.some.company.task.AbstractTask.resetState:183) - user some_user disconnected, cancel user heartbeat: true

2019-02-17 00:33:18,962 INFO  (biz_thread_64:com.some.company.task.TaskManager.exec:170) - task is overtime: task=user get device offline notification, devMac=some_device, userToken=some_user

There is a jump between line 3 and 4. But normally the time interval between the two steps should be little than 4 minutes. Looks like the app was frozen between 22:20 and 00:33.

Some other useful info:

  1. The thread pool is fixed-sized;

  2. The app is running in a vm but there is another java app which has logs between 22:20 and 00:33, so the vm did not sleep in that time range;

  3. There is another fixed delay task printing each business task status but it also did not print anything between 22:20 and 00:33;

  4. There is a http server based on Vertx to provide RESTful services. The RESTful services are non-responsive when requested when the problem occurred, i.e. the browser is blocked when issuing a http GET request to the http server. But telnet worked to the http server.

Any suggestions will be appreciated.

Update(2019-02-26 18:13)

It happened again! My app froze at 2019-02-26 17:30:17,576 and recovered at 2019-02-26 19:40:29,655, about 2 hours, 10 minutes and 12 seconds. Last occurrence of this problem last for 2 hours, 12 minutes and 53 seconds.

When the problem occurred, both jstack and jmap command failed to execute:

[cloud-user@extdev-platform-ug bin]$ ~/jdk1.8.0_192/bin/jstack -F 10914
Attaching to process ID 10914, please wait...
Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
sun.jvm.hotspot.debugger.DebuggerException: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.execute(LinuxDebuggerLocal.java:163)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.attach(LinuxDebuggerLocal.java:278)
    at sun.jvm.hotspot.HotSpotAgent.attachDebugger(HotSpotAgent.java:671)
    at sun.jvm.hotspot.HotSpotAgent.setupDebuggerLinux(HotSpotAgent.java:611)
    at sun.jvm.hotspot.HotSpotAgent.setupDebugger(HotSpotAgent.java:337)
    at sun.jvm.hotspot.HotSpotAgent.go(HotSpotAgent.java:304)
    at sun.jvm.hotspot.HotSpotAgent.attach(HotSpotAgent.java:140)
    at sun.jvm.hotspot.tools.Tool.start(Tool.java:185)
    at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
    at sun.jvm.hotspot.tools.JStack.main(JStack.java:92)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.tools.jstack.JStack.runJStackTool(JStack.java:140)
    at sun.tools.jstack.JStack.main(JStack.java:106)
Caused by: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.attach0(Native Method)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.access$100(LinuxDebuggerLocal.java:62)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$1AttachTask.doit(LinuxDebuggerLocal.java:269)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.run(LinuxDebuggerLocal.java:138)

[cloud-user@extdev-platform-ug bin]$ ~/jdk1.8.0_192/bin/jmap -heap 10914
Attaching to process ID 10914, please wait...
Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
sun.jvm.hotspot.debugger.DebuggerException: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.execute(LinuxDebuggerLocal.java:163)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.attach(LinuxDebuggerLocal.java:278)
    at sun.jvm.hotspot.HotSpotAgent.attachDebugger(HotSpotAgent.java:671)
    at sun.jvm.hotspot.HotSpotAgent.setupDebuggerLinux(HotSpotAgent.java:611)
    at sun.jvm.hotspot.HotSpotAgent.setupDebugger(HotSpotAgent.java:337)
    at sun.jvm.hotspot.HotSpotAgent.go(HotSpotAgent.java:304)
    at sun.jvm.hotspot.HotSpotAgent.attach(HotSpotAgent.java:140)
    at sun.jvm.hotspot.tools.Tool.start(Tool.java:185)
    at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
    at sun.jvm.hotspot.tools.HeapSummary.main(HeapSummary.java:49)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.tools.jmap.JMap.runTool(JMap.java:201)
    at sun.tools.jmap.JMap.main(JMap.java:130)
Caused by: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process: ptrace(PTRACE_ATTACH, ..) failed for 10914: Operation not permitted
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.attach0(Native Method)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.access$100(LinuxDebuggerLocal.java:62)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$1AttachTask.doit(LinuxDebuggerLocal.java:269)
    at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.run(LinuxDebuggerLocal.java:138)

The last GC log before the problem

[GC pause (G1 Evacuation Pause) (young), 0.0388874 secs]
   [Parallel Time: 26.9 ms, GC Workers: 2]
      [GC Worker Start (ms): Min: 22597871.6, Avg: 22597871.6, Max: 22597871.7, Diff: 0.0]
      [Ext Root Scanning (ms): Min: 3.8, Avg: 3.9, Max: 4.0, Diff: 0.2, Sum: 7.9]
      [Update RS (ms): Min: 9.8, Avg: 9.8, Max: 9.8, Diff: 0.1, Sum: 19.6]
         [Processed Buffers: Min: 245, Avg: 266.0, Max: 287, Diff: 42, Sum: 532]
      [Scan RS (ms): Min: 2.3, Avg: 2.3, Max: 2.4, Diff: 0.2, Sum: 4.7]
      [Code Root Scanning (ms): Min: 0.2, Avg: 0.7, Max: 1.2, Diff: 1.0, Sum: 1.4]
      [Object Copy (ms): Min: 9.4, Avg: 9.9, Max: 10.5, Diff: 1.1, Sum: 19.8]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 2]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: 0.1]
      [GC Worker Total (ms): Min: 26.7, Avg: 26.7, Max: 26.7, Diff: 0.0, Sum: 53.5]
      [GC Worker End (ms): Min: 22597898.4, Avg: 22597898.4, Max: 22597898.4, Diff: 0.0]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.9 ms]
   [Other: 11.1 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 8.4 ms]
      [Ref Enq: 0.1 ms]
      [Redirty Cards: 0.2 ms]
      [Humongous Register: 0.0 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.7 ms]
   [Eden: 604.0M(604.0M)->0.0B(604.0M) Survivors: 10.0M->10.0M Heap: 707.4M(1024.0M)->103.4M(1024.0M)]
 [Times: user=0.06 sys=0.00, real=0.04 secs] 

My app run with options

-Xms1g -Xmx1g -server -XX:+PrintGCDetails -verbose:gc -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+HeapDumpOnOutOfMemoryError

top command gives

top -H -p 10914

Tasks: 545 total,   0 running,   0 sleeping, 545 stopped,   0 zombie
Cpu0  :  0.7%us,  1.0%sy,  0.0%ni, 97.3%id,  0.0%wa,  0.0%hi,  0.0%si,  1.0%st
Cpu1  :  4.0%us,  0.7%sy,  0.0%ni, 95.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   5991196k total,  5858276k used,   132920k free,   225448k buffers
Swap:  6291448k total,   132044k used,  6159404k free,  3820256k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                 
10914 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:00.10 java                                                                     
10915 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:03.17 java                                                                     
10916 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:00.68 java                                                                     
10917 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:00.71 java                                                                     
10918 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:20.87 java                                                                     
10919 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:00.07 java                                                                     
10920 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:00.13 java                                                                     
10921 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:00.05 java                                                                     
10922 cloud-us  20   0 4151m 1.0g  15m T  0.0 17.3   0:00.16 java

netstat command showed that there are 16 TCP connections with CLOSE_WAIT or ESTABLISHED status, and based on the log, 16 is also the number of chained tasks that did not finished normally.

My app periodically prints logs of status of threads in business thread pool. The last one before the problem

    "biz_thread_0":"TIMED_WAITING",
    "biz_thread_1":"TIMED_WAITING",
    "biz_thread_2":"TIMED_WAITING",
    "biz_thread_3":"TIMED_WAITING",
    "biz_thread_4":"TIMED_WAITING",
    "biz_thread_5":"TIMED_WAITING",
    "biz_thread_6":"WAITING",
    "biz_thread_7":"TIMED_WAITING",
    "biz_thread_8":"TIMED_WAITING",
    "biz_thread_9":"TIMED_WAITING",
    "biz_thread_10":"WAITING",
    "biz_thread_11":"TIMED_WAITING",
    "biz_thread_12":"WAITING",
    "biz_thread_13":"TIMED_WAITING",
    "biz_thread_14":"TIMED_WAITING",
    "biz_thread_15":"TIMED_WAITING",
    "biz_thread_16":"TIMED_WAITING",
    "biz_thread_17":"TIMED_WAITING",
    "biz_thread_18":"TIMED_WAITING",
    "biz_thread_19":"TIMED_WAITING",
    "biz_thread_20":"TIMED_WAITING",
    "biz_thread_21":"TIMED_WAITING",
    "biz_thread_22":"TIMED_WAITING",
    "biz_thread_23":"TIMED_WAITING",
    "biz_thread_24":"TIMED_WAITING",
    "biz_thread_25":"TIMED_WAITING",
    "biz_thread_26":"WAITING",
    "biz_thread_27":"WAITING",
    "biz_thread_28":"WAITING",
    "biz_thread_29":"WAITING",
    "biz_thread_30":"WAITING",
    "biz_thread_31":"TIMED_WAITING",
    "biz_thread_32":"TIMED_WAITING",
    "biz_thread_33":"TIMED_WAITING",
    "biz_thread_34":"RUNNABLE",
    "biz_thread_35":"TIMED_WAITING",
    "biz_thread_36":"TIMED_WAITING",
    "biz_thread_37":"WAITING",
    "biz_thread_38":"TIMED_WAITING",
    "biz_thread_39":"TIMED_WAITING",
    "biz_thread_40":"TIMED_WAITING",
    "biz_thread_41":"TIMED_WAITING",
    "biz_thread_42":"TIMED_WAITING",
    "biz_thread_43":"TIMED_WAITING",
    "biz_thread_44":"WAITING",
    "biz_thread_45":"WAITING",
    "biz_thread_46":"TIMED_WAITING",
    "biz_thread_47":"TIMED_WAITING",
    "biz_thread_48":"TIMED_WAITING",
    "biz_thread_49":"TIMED_WAITING",
    "biz_thread_50":"TIMED_WAITING",
    "biz_thread_51":"WAITING",
    "biz_thread_52":"TIMED_WAITING",
    "biz_thread_53":"TIMED_WAITING",
    "biz_thread_54":"TIMED_WAITING",
    "biz_thread_55":"TIMED_WAITING",
    "biz_thread_56":"WAITING",
    "biz_thread_57":"TIMED_WAITING",
    "biz_thread_58":"TIMED_WAITING",
    "biz_thread_59":"TIMED_WAITING",
    "biz_thread_60":"TIMED_WAITING",
    "biz_thread_61":"TIMED_WAITING",
    "biz_thread_62":"TIMED_WAITING",
    "biz_thread_63":"TIMED_WAITING",
    "biz_thread_64":"TIMED_WAITING",
    "biz_thread_65":"WAITING",
    "biz_thread_66":"TIMED_WAITING",
    "biz_thread_67":"WAITING",
    "biz_thread_68":"WAITING",
    "biz_thread_69":"TIMED_WAITING",
    "biz_thread_70":"TIMED_WAITING",
    "biz_thread_71":"TIMED_WAITING",
    "biz_thread_72":"WAITING",
    "biz_thread_73":"TIMED_WAITING",
    "biz_thread_74":"TIMED_WAITING",
    "biz_thread_75":"TIMED_WAITING",
    "biz_thread_76":"WAITING",
    "biz_thread_77":"TIMED_WAITING",
    "biz_thread_78":"TIMED_WAITING",
    "biz_thread_79":"TIMED_WAITING",
    "biz_thread_80":"TIMED_WAITING",
    "biz_thread_81":"TIMED_WAITING",
    "biz_thread_82":"WAITING",
    "biz_thread_83":"TIMED_WAITING",
    "biz_thread_84":"TIMED_WAITING",
    "biz_thread_85":"TIMED_WAITING",
    "biz_thread_86":"TIMED_WAITING",
    "biz_thread_87":"TIMED_WAITING",
    "biz_thread_88":"TIMED_WAITING",
    "biz_thread_89":"WAITING",
    "biz_thread_90":"TIMED_WAITING",
    "biz_thread_91":"TIMED_WAITING",
    "biz_thread_92":"TIMED_WAITING",
    "biz_thread_93":"TIMED_WAITING",
    "biz_thread_94":"WAITING",
    "biz_thread_95":"TIMED_WAITING",
    "biz_thread_96":"TIMED_WAITING",
    "biz_thread_97":"TIMED_WAITING",
    "biz_thread_98":"TIMED_WAITING",
    "biz_thread_99":"TIMED_WAITING",
    "biz_thread_100":"TIMED_WAITING",
    "biz_thread_101":"TIMED_WAITING",
    "biz_thread_102":"TIMED_WAITING",
    "biz_thread_103":"WAITING",
    "biz_thread_114":"WAITING",
    "biz_thread_113":"TIMED_WAITING",
    "biz_thread_112":"WAITING",
    "biz_thread_111":"TIMED_WAITING",
    "biz_thread_110":"TIMED_WAITING",
    "biz_thread_109":"TIMED_WAITING",
    "biz_thread_108":"TIMED_WAITING",
    "biz_thread_107":"TIMED_WAITING",
    "biz_thread_106":"TIMED_WAITING",
    "biz_thread_105":"WAITING",
    "biz_thread_104":"TIMED_WAITING",
    "biz_thread_115":"TIMED_WAITING",
    "biz_thread_116":"TIMED_WAITING",
    "biz_thread_122":"TIMED_WAITING",
    "biz_thread_121":"TIMED_WAITING",
    "biz_thread_120":"TIMED_WAITING",
    "biz_thread_119":"TIMED_WAITING",
    "biz_thread_118":"TIMED_WAITING",
    "biz_thread_117":"TIMED_WAITING",
    "biz_thread_126":"TIMED_WAITING",
    "biz_thread_125":"TIMED_WAITING",
    "biz_thread_124":"WAITING",
    "biz_thread_123":"WAITING",
    "biz_thread_127":"WAITING"
Allen Xu
  • 133
  • 1
  • 10
  • Hi Allen, did your computer went to sleep between those times? Could you provide some more java or system event logs from within that timeframe? – Dilyano Senders Feb 18 '19 at 10:29
  • @DilyanoSenders The computer did not go to sleep as there is another java app which has logs between 22:20 and 00:33. Sorry I can not provide more system event logs. More code can be provided if needed. – Allen Xu Feb 18 '19 at 10:45
  • If you could provide a little bit more that would be nice, because If I look at this code an logging, I would say the JVM went in to a hibernate state of the other java proces took the JVM for usage – Dilyano Senders Feb 18 '19 at 10:48
  • @DilyanoSenders The other java process run with JDK1.7 while my app run with JDK1.8. Maybe the problem is the usage of the thread pool. Each business tasks consists of several steps and all steps will timeout if it took too long time. I updated the code about timeout task handling. – Allen Xu Feb 18 '19 at 11:00
  • But they are using the same JVM right? I mean you can use different JDK's (development kits) to develop, but the run can still happen on only 1 JVM. – Dilyano Senders Feb 18 '19 at 11:02
  • @DilyanoSenders No they use different JVM. I setup a new JDK in the computer to run my app. The other app run before I setup that JDK. – Allen Xu Feb 18 '19 at 11:07
  • Sorry maybe a bit of a hard question, but you know the difference between a JDK and a JVM right? – Dilyano Senders Feb 18 '19 at 11:15
  • @DilyanoSenders Yeah. My app and the other app are two different processes. So they should not run in one same JVM. – Allen Xu Feb 19 '19 at 01:14
  • Ok, could you then please provide some more code/logging? – Dilyano Senders Feb 19 '19 at 17:18
  • @DilyanoSenders Thank you for your help. I've created a github repo to demonstrate my app https://github.com/xiuyouxu/frozen-java-app. The real task in my app is doing socket connection and writing or reading messages with netty. The repo using Thread.sleep to simulate real tasks. – Allen Xu Feb 20 '19 at 09:38
  • How do you know that the threads were blocked on poll, do you have stack traces? And are any of your methods synchronized? If isDone or setOvertime is synchronized, for example, the thread could just as well hang trying to call one of them. – ewramner Feb 21 '19 at 16:12
  • Looking at the code in your git repo you synchronize on task both in the if statement above (before calling setOvertime) and in exec. That means that you won't be able to proceed and call setOvertime until exec has returned. That could explain the lack of logs, presuming that one of your tasks has indeed stopped responding. – ewramner Feb 21 '19 at 16:18
  • @ewramner I think the threads were blocked on poll because the log showed that, but I do not have stack traces. The code of synchronized block in method exec should not block anything. The tasks are chained and run synchronously in method exec calling task.process(). By the way there are no synchronized when the problem occurred. I added that to prevent one task normally finished but in the same time timeout. A task should only timeout or finish normally. – Allen Xu Feb 22 '19 at 01:57
  • @ewramner The real tasks are making http request, connecting to a tcp server and sending some data and receiving some data, all with timeout set. I will check all the tasks and the log again to confirm your suggestion. – Allen Xu Feb 22 '19 at 02:20
  • You couldn't attach a debugger nor print the stacktrace. That means your JVM couldn't get to a safe point to run these admin operations. What type of ThreadPool (fixed, cached, etc..) are you using? – John Vint Feb 26 '19 at 15:23
  • @JohnVint Fixed. – Allen Xu Feb 27 '19 at 00:44

1 Answers1

4

Finally!

I do not know which java option or other things helped, but there is full thread dump in the nohup stdout redirected file(There was no thread dump for the first time when it occurred).

2019-02-26 19:40:29
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.192-b12 mixed mode):

"biz_thread_87" #104 prio=5 os_prio=0 tid=0x00007f93e76fd800 nid=0x2b24 waiting for monitor entry [0x00007f938bfbe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.checkRollover(RollingFileManager.java:217)
    - waiting to lock <0x00000000c06725d8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager)
    at org.apache.logging.log4j.core.appender.RollingRandomAccessFileAppender.append(RollingRandomAccessFileAppender.java:207)
    at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
    at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:448)
    at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:433)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
    at org.apache.logging.log4j.core.config.LoggerConfig.logParent(LoggerConfig.java:439)
    at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:434)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:403)
    at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
    at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2091)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:1999)
    at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1868)
    at org.apache.logging.slf4j.Log4jLogger.info(Log4jLogger.java:194)
    at com.some.company.task.impl.StartTask.interruptProcess(StartTask.java:17)
    at com.some.company.common.task.AbstractTask.process(AbstractTask.java:88)
    at com.some.company.common.task.TaskManager.exec(TaskManager.java:174)
    at com.some.company.common.task.Context.lambda$resubmit$0(Context.java:426)
    at com.some.company.common.task.Context$$Lambda$97/519533636.run(Unknown Source)
    at com.some.company.common.task.TaskManager.lambda$delayExec$2(TaskManager.java:121)
    at com.some.company.common.task.TaskManager$$Lambda$37/873634936.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"nioEventLoopGroup-173-1" #315 prio=10 os_prio=0 tid=0x00007f939420e000 nid=0x2ced runnable [0x00007f937b3e8000]
   java.lang.Thread.State: RUNNABLE
    at java.io.RandomAccessFile.writeBytes(Native Method)
    at java.io.RandomAccessFile.write(RandomAccessFile.java:525)
    at org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager.writeToDestination(RollingRandomAccessFileManager.java:109)
    - eliminated <0x00000000c06725d8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager)
    at org.apache.logging.log4j.core.appender.OutputStreamManager.flushBuffer(OutputStreamManager.java:288)
    - eliminated <0x00000000c06725d8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager)
    at org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager.flush(RollingRandomAccessFileManager.java:128)
    - locked <0x00000000c06725d8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager)
    at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:179)
    at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:170)
    at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:161)
    at org.apache.logging.log4j.core.appender.RollingRandomAccessFileAppender.append(RollingRandomAccessFileAppender.java:218)
    at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
    at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:448)
    at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:433)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
    at org.apache.logging.log4j.core.config.LoggerConfig.logParent(LoggerConfig.java:439)
    at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:434)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:403)
    at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
    at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2091)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2005)
    at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1876)
    at org.apache.logging.slf4j.Log4jLogger.info(Log4jLogger.java:184)
    at com.some.company.UserWifiClient.sendHeartbeat(UserWifiEppClient.java:39)
    at com.some.company.AbstractNettyClient.lambda$scheduleHeartbeat$0(AbstractNettyClient.java:146)
    at com.some.company.AbstractNettyClient$$Lambda$73/2134853806.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:133)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:495)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)

There are 101 BLOCKED threads and 94 threads "waiting to lock <0x00000000c06725d8>".

So I think it must be a misuse of log4j Logger. In AbstractTask the "logger" is created like below

protected final static Logger logger = LoggerFactory.getLogger(AbstractTask.class);

All AbstractTask implementations just use this "logger" but the logger created this way introduced contention when multi-thread trying to use it. Now I create logger this way in AbstractTask

protected Logger logger = LoggerFactory.getLogger(getClass());

Then each real task should have their own logger object to use and will not face race-condition problem.

See this: Should logger be private static or not

What a big lesson!

By the way, did JVM auto dump threads to stdout sometime? I found the above "Full thread dump ..." in the nohup stdout redirected file.

Was this kind of thread-blocked problem a dead-lock? The app recovered about 2 hours later. If it was a dead-lock, how could it recover?

Update(2019-03-11)

Actually after I modified the way of using log4j logger, I accidentally came across this problem again, even after one week smooth run of the app. I realized that maybe it was caused by jmap. I run below command last Friday for several hours

while [ true ]; do jmap -heap 340; echo "=================================================="; sleep 30; done

Then at about 17:00 the app froze again! So if your app run correctly, run jmap or jstack against the app pid as fewer as possible! These commands will hang JVM. When they are executed too frequently, JVM will possibly get-frozen.

Allen Xu
  • 133
  • 1
  • 10
  • glad you fixed it. just for the info, are you using log4j1? – aran Feb 27 '19 at 09:04
  • No, I use log4j 2.8.2 with slf4j 1.7.25 and org.apache.logging.log4j:log4j-slf4j-impl:2.8.2. – Allen Xu Feb 27 '19 at 09:09
  • A `kill -3` command will send the stacktrace to the stdout. It's likely you ran that on the pid for the process. The `RollingFileManager` `checkRollover` method will synchronize on `this`. The line `waiting for monitor entry [0x00007f938bfbe000]` is just saying "This thread is waiting to enter the synchronized block for object reference 0x00007f938bfbe000". There is one runnable thread owning the lock for object `0x00007f938bfbe000`. I would look for that in the thread-dump. I would be that one thread is spending time rolling over the file and it's likely taking a while. – John Vint Feb 27 '19 at 13:32
  • @JohnVint I had not sent `kill -3` to the process. Actually the line `- waiting to lock <0x00000000c06725d8>` showed the object that thread trying to lock. Check this: https://stackoverflow.com/questions/37304897/how-to-find-out-which-thread-holds-the-monitor – Allen Xu Feb 28 '19 at 02:11