5

I have been seeing extremely long application pauses in the GC logs that are seemingly not related to garbage collection. The application is on a 16GB heap and allocated memory throughput is quite large. I understand that the GC logs will show pauses that aren't necessarily related to GC and that seems to be the case here.

Also, during these pauses the CPU hits 100%, there is no disk or memory I/O and there is no logging activity (due to all the application threads being stopped).

Any ideas on what may be causing these long pauses, or any suggestions on how I can track the root cause down?

Here is the relevant snippet from the logs (notice the 151 second pause):

2015-04-21T06:30:04.834-0500: 217179.483: [GC (Allocation Failure)2015-04-21T06:30:04.849-0500: 217179.483: [ParNew
Desired survivor size 43614208 bytes, new threshold 6 (max 6)
- age   1:   19876896 bytes,   19876896 total
- age   2:   15903904 bytes,   35780800 total
- age   3:     617120 bytes,   36397920 total
- age   4:      34096 bytes,   36432016 total
- age   5:     399896 bytes,   36831912 total
- age   6:      71112 bytes,   36903024 total
: 706301K->45145K(766784K), 0.0681632 secs] 19029029K->18368371K(25080640K), 0.0684183 secs] [Times: user=0.44 sys=0.00, real=0.08 secs] 
2015-04-21T06:30:05.816-0500: 217180.453: Total time for which application threads were stopped: 1.8507910 seconds
2015-04-21T06:30:09.171-0500: 217183.812: [GC (Allocation Failure)2015-04-21T06:30:09.171-0500: 217183.812: [ParNew
Desired survivor size 43614208 bytes, new threshold 6 (max 6)
- age   1:   14812976 bytes,   14812976 total
- age   2:    3411968 bytes,   18224944 total
- age   3:   14952032 bytes,   33176976 total
- age   4:     430048 bytes,   33607024 total
- age   5:      32864 bytes,   33639888 total
- age   6:     374880 bytes,   34014768 total
: 726745K->43352K(766784K), 0.0673428 secs] 19049971K->18366645K(25080640K), 0.0675792 secs] [Times: user=0.48 sys=0.00, real=0.06 secs] 
2015-04-21T06:30:09.249-0500: 217183.886: Total time for which application threads were stopped: 0.0763406 seconds
2015-04-21T06:32:43.960-0500: 217338.603: Total time for which application threads were stopped: 151.1866188 seconds
2015-04-21T06:32:43.975-0500: 217338.607: [GC (GCLocker Initiated GC)2015-04-21T06:32:43.975-0500: 217338.607: [ParNew
Desired survivor size 43614208 bytes, new threshold 4 (max 6)
- age   1:   18173880 bytes,   18173880 total
- age   2:    9659800 bytes,   27833680 total
- age   3:    3400176 bytes,   31233856 total
- age   4:   14928944 bytes,   46162800 total
- age   5:     375056 bytes,   46537856 total
- age   6:      31880 bytes,   46569736 total
: 727682K->56678K(766784K), 0.0759093 secs] 19050975K->18380341K(25080640K), 0.0760795 secs] [Times: user=0.58 sys=0.00, real=0.08 secs] 
2015-04-21T06:32:44.053-0500: 217338.684: Total time for which application threads were stopped: 0.0769955 seconds

Here are my JVM settings at the moment:

-Xmx16g
-XX:MaxPermSize=256m
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=true
-Dcom.sun.management.jmxremote.login.config=JmxFileLoginModule
-Dcom.sun.management.jmxremote.port=8082
-Dcom.sun.management.jmxremote.ssl=false
-Dfile.encoding=UTF8
-Djava.awt.headless=true
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Xloggc:{fileLoc}
-XX:+DisableExplicitGC
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCCause
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintTenuringDistribution
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:GCLogFileSize=50M
-XX:NumberOfGCLogFiles=10
-XX:ReservedCodeCacheSize=256M
-XX:-UseGCLogFileRotation
-XX:+PrintSafepointStatistics
-XX:+SafepointTimeout
-XX:SafepointTimeoutDelay=4000
-XX:+PrintSafepointStatistics

UPDATE:

I forgot to include my safepoint logging here. This is what my safepoint log looks like for the 151 second pause experienced above:

217338.603: [sweeping nmethods, 0.0000036 secs]
             vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
    217187.422: GenCollectForAllocation          [     140         46              1    ]      [     1     0151184     2     0    ]  0

The block time is 0, but it appears right next to the sync time, so the sync time is actually 151184.

For other GC-related pauses the safepoint log appears like this:

107100.562: [sweeping nmethods, 0.0000015 secs]
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
107100.563: GenCollectForAllocation          [     117          0              0    ]      [     0     0     0     0 38238    ]  0

So, the GC pause shows as vmop time and the non-GC pauses show as sync time. I'll have to read more about what this means.

jgoetz
  • 61
  • 3
  • How are you measuring "Total time for which application threads were stopped"? Or is that part of the GC output? – Sbodd Apr 22 '15 at 20:32
  • Just use [*this technique*](http://stackoverflow.com/a/317160/23771). Some thread will be doing something, and it will show you what. If it stops in GC, that's not going to tell you much, but everything GC'd also has to be `new`ed, so if you catch it in `new` (on more than one sample), it's worth trying to avoid doing that. – Mike Dunlavey Apr 22 '15 at 20:33
  • Total time for which application threads were stopped is coming from the option "PrintGCApplicationStoppedTime" – jgoetz Apr 22 '15 at 20:41
  • 1
    probably a non-GC safepoint, use `-XX:+PrintSafepointStatistics –XX:PrintSafepointStatisticsCount=1 ` to print additional information. (those logs go to stdout, not the gc log file, so maybe you were looking at the wrong place – the8472 Apr 22 '15 at 20:44
  • @the8472 - Thanks, I forgot to include my safepoint logging. I added it to the description above. – jgoetz Apr 22 '15 at 21:51

2 Answers2

2

We've experienced a similar problem that appeared to be a JVM bug JDK-8041984.
I'm not sure this is your case as well, but consider updating to JDK 8u45 or 7u80.

Whenever you see this problem again try to take a forced stack dump with the native frames during this long pause:

jstack -F -m <PID>

This will help to identify what the VM is doing even if the process is not responding.

apangin
  • 92,924
  • 10
  • 193
  • 247
  • Thanks for pointing out the JVM bug. I'll see if updating could solve the problem. As for the stack dump, I've found it difficult to get a stack dump because the cpu is at 100% and completely unresponsive. I've taken flight recordings using Java Mission Control and it takes thread dumps every minute during a recording. However, when the pause happens the thread dump isn't taken, so I'm at a bit of a loss of what to do there. – jgoetz Apr 23 '15 at 05:43
  • Right, you won't be able to take thread dumps using JMC since it expects cooperation from JVM. But `jstack` should work. You probably need to adjust the process priority (using `nice`) so that you could run commands even when JVM utilizes 100% CPU. – apangin Apr 23 '15 at 11:51
0

The block time is 0, but it appears right next to the sync time, so the sync time is actually 151184.

You're reading the output wrong, due to tabs the headers are not properly aligned, you need to count columns.

As I understand it it's basically waiting 151 seconds to reach a safepoint because a mutator-thread remains a live despite a safepoint trap being active.

This can happen either when it spends too much time in native code or as @apangin mentions when compiler threads do excessive amounts of work.

I think there were also some cases where the compiler would optimize away a safepoint in a loop backedge when it could prove that the loop finishes in a finite amount of time, which could take several seconds in loops iterating over long counters.

the8472
  • 40,999
  • 5
  • 70
  • 122