13

Over the last couple of days, we are seeing JVM's on our servers get into a state where they are spending 100% CPU time in GC of OldGen, when:

A. They don't need to because there is plenty of space left on the heap and

B. They are not reclaiming anything.

I know they are in GC by looking at a stack trace and correlating the ThreadID's in ProcessExplorer to those in the stack dump. Each GC thread is occupying around 4% CPU.

Servers are running 16 gig heaps (32gig physical RAM) & have 8 cores. Up-time is usually around 30 days, with reboots required only due to MS patching requirements, but currently they are crashing at the 20 day mark.

Here is a graph of Duration, time scale = 19 days. http://i45.tinypic.com/257qalu.png

Here is a zoom of the tail of that graph http://i48.tinypic.com/2duiccw.png

As you can see, duration increases dramatically.

Here is a graph of the heap usage after GC. http://i48.tinypic.com/znna4h.png

If it was a typical memory leak, I would expect to see the orange peaks get higher and higher until they can't peak any more, but as this graph shows, there is plenty of heap space left.

I've got heap dumps for each server and nothing stands out as being problematic. There are a few ehCache stores, I can see our application code, ie, just "normal stuff"

The biggest change we made ~20 days ago was to implement a vendor patch that changed an internal cache from an unbounded hashmap which used hard references (and an obvious memory leak) to one that consisted of soft references & I am wondering if this is the cause, ie, somehow there is a massive overhead in managing these soft references after a point?

Does anyone have any ideas as to where to look next, or can someone confirm my soft reference theory?

Here is my jvm.args:

java.args=-server -Xms16000m -Xmx16000m -Dsun.io.useCanonCaches=false -XX:MaxPermSize=350m -Xloggc:e:/gcLogs/eRGCLogs.txt -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseParallelGC -XX:+UseParallelOldGC -Dnet.sf.ehcache.sizeof.filter=D:/jo3/java_ehCacheOpenSource/sizeOfExclusions.config -Xbatch -Dcoldfusion.rootDir={application.home}/../ -Dcoldfusion.libPath={application.home}/../lib -Dcoldfusion.classPath={application.home}/../lib/updates,{application.home}/../lib,{application.home}/../gateway/lib/,{application.home}/../wwwroot/WEB-INF/flex/jars,{application.home}/../wwwroot/WEB-INF/cfform/jars,d:/jo3/java,d:/JO3/java_ehCacheOpenSource/,D:/jo3/java_ehCacheMonitorProbe

We are on Coldfusion, which is kind of like a massive framework that sits on top of java.

JVM version: 1.6.0_29

As requested, "normal" GC logs look like:

2013-03-19T22:11:36.670+1100: 1288665.702: [GC [PSYoungGen: 4695800K->471119K(4722112K)] 9301727K->5077046K(15644800K), 0.3584434 secs] [Times: user=5.01 sys=0.00, real=0.36 secs] 2013-03-19T22:14:55.078+1100: 1288864.099: [GC [PSYoungGen: 4722063K->498009K(4783104K)] 9327990K->5103936K(15705792K), 0.3766889 secs] [Times: user=5.37 sys=0.00, real=0.38 secs] 2013-03-19T22:17:46.749+1100: 1289035.760: [GC [PSYoungGen: 4654489K->517299K(4673792K)] 9260416K->5123227K(15596480K), 0.4130828 secs] [Times: user=5.80 sys=0.00, real=0.41 secs] 2013-03-19T22:21:08.762+1100: 1289237.763: [GC [PSYoungGen: 4673779K->522660K(4738880K)] 9279707K->5143831K(15661568K), 0.4005516 secs] [Times: user=5.97 sys=0.00, real=0.40 secs] 2013-03-19T22:23:42.683+1100: 1289391.675: [GC [PSYoungGen: 4582628K->530998K(4590976K)] 9203799K->5186242K(15513664K), 0.4317352 secs] [Times: user=6.24 sys=0.00, real=0.43 secs] 2013-03-19T22:26:11.096+1100: 1289540.080: [GC [PSYoungGen: 4590966K->518331K(4724096K)] 9246210K->5206959K(15646784K), 0.3914401 secs] [Times: user=5.99 sys=0.00, real=0.39 secs] 2013-03-19T22:27:44.076+1100: 1289633.055: [GC [PSYoungGen: 2602730K->447527K(4732864K)] 7291358K->5208743K(15655552K), 0.3725317 secs] [Times: user=5.80 sys=0.00, real=0.37 secs] 2013-03-19T22:27:44.448+1100: 1289633.428: [Full GC (System) [PSYoungGen: 447527K->0K(4732864K)] [ParOldGen: 4761215K->4628296K(10922688K)] 5208743K->4628296K(15655552K) [PSPermGen: 352378K->352287K(352832K)], 4.2955639 secs] [Times: user=57.70 sys=0.06, real=4.30 secs] 2013-03-19T22:30:37.950+1100: 1289806.920: [GC [PSYoungGen: 4004416K->70948K(4690432K)] 8632712K->4699245K(15613120K), 0.1062227 secs] [Times: user=0.76 sys=0.00, real=0.11 secs] 2013-03-19T22:33:27.154+1100: 1289976.115: [GC [PSYoungGen: 4054116K->109175K(4092352K)] 8682413K->4737472K(15015040K), 0.1347919 secs] [Times: user=1.03 sys=0.00, real=0.13 secs] 2013-03-19T22:36:32.120+1100: 1290161.070: [GC [PSYoungGen: 4092343K->147318K(4712320K)] 8720640K->4775615K(15635008K), 0.1593523 secs] [Times: user=1.58 sys=0.00, real=0.16 secs] 2

When we are in failure mode, the GC logs look like:

2013-03-22T10:03:47.619+1100: 1504185.901: [GC [PSYoungGen: 0K->0K(5452736K)] 4413907K->4413907K(16375424K), 0.0114248 secs] [Times: user=0.16 sys=0.00, real=0.01 secs] 2013-03-22T10:03:47.631+1100: 1504185.912: [Full GC [PSYoungGen: 0K->0K(5452736K)] [ParOldGen: 4413907K->4412613K(10922688K)] 4413907K->4412613K(16375424K) [PSPermGen: 358399K->358278K(358400K)], 5.4435442 secs] [Times: user=73.74 sys=0.14, real=5.44 secs] 2013-03-22T10:03:53.145+1100: 1504191.426: [GC [PSYoungGen: 269219K->7734K(5449088K)] 4681833K->4422114K(16371776K), 0.0298728 secs] [Times: user=0.34 sys=0.00, real=0.03 secs] 2013-03-22T10:03:53.175+1100: 1504191.456: [Full GC [PSYoungGen: 7734K->0K(5449088K)] [ParOldGen: 4414379K->4415189K(10922688K)] 4422114K->4415189K(16371776K) [PSPermGen: 358399K->358371K(358400K)], 2.6033684 secs] [Times: user=36.33 sys=0.00, real=2.60 secs] 2013-03-22T10:03:55.788+1100: 1504194.069: [GC [PSYoungGen: 94969K->826K(5451328K)] 4510158K->4416015K(16374016K), 0.0133588 secs] [Times: user=0.16 sys=0.00, real=0.01 secs] 2013-03-22T10:03:55.802+1100: 1504194.082: [Full GC [PSYoungGen: 826K->0K(5451328K)] [ParOldGen: 4415189K->4415348K(10922688K)] 4416015K->4415348K(16374016K) [PSPermGen: 358399K->358389K(358400K)], 2.7156884 secs] [Times: user=38.11 sys=0.00, real=2.71 secs] 2

Dave
  • 1,068
  • 12
  • 26
  • 3
    If your heap is not full, another reason full GCs happen is because PermGen is getting full. Is your PermGen memory use close to its limit? – prunge Mar 22 '13 at 03:50
  • 1
    16Gb for the JVM heap... that's a lot. GC will pause all your therads...so app literally halt during GC cycle. Normally we don't use beyond 5Gb... is it time to scale horizontally? – user1697575 Mar 22 '13 at 03:53
  • 1
    Regardless of whatever your problem is, you should be using the CMS collector (-XX:+UseConcMarkSweepGC) if latency is important to you. You also should optimally set NewRatio (or equiv) based on your app's behavior. As @prunge said, I would bet that PermGen is your problem. – mikeslattery Mar 22 '13 at 04:01
  • Cheers guys - We know we are running with a massive heap - time in full GC is normally around 4 seconds on average with an interval of 10 minutes. We will look at reducing it in the future. We do scale-out - around 10+ app servers atm. We used to use massive ehCache's – Dave Mar 22 '13 at 04:02
  • Will checkout the cms collector - any idea to see permgen usage stats from a heap-dump? (permgen not on the heap) – Dave Mar 22 '13 at 04:03
  • 2
    Can't get permgen usage from heap dump, but can use [jstat](http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html#gcpermcapacity_option) tool. Suggest using jstat with `-gcpermcapacity` option, will give details on PermGen usage. Also suggest `-gccause` option which may help indicate why full GCs are occuring (GCC/LGCC Cause of current/last Garbage Collection will be of interest) – prunge Mar 22 '13 at 04:17
  • 1
    If you've got verbose GC logging on (a good idea) then you ought to have permgen stats already. I use TagTraum GCViewer to review these logs. You can also enable JMX on the VM then connect with JVisualVM to see heap and perm stats, live: -Dcom.sun.management.jmxremote.port=3333 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false – barnyr Mar 22 '13 at 13:09
  • thanks barnyr, TagTrum GCViewer confirmed the problem - that's a nice tool, much better then HPJMeter. – Dave Mar 22 '13 at 23:08
  • 1
    There's a more up to date fork here: https://github.com/chewiebug/GCViewer - which I think handles newer GC logging formats better. It's been an invaluable tool for me in debugging a number of GC-related issues over the years – barnyr Mar 25 '13 at 09:57

1 Answers1

5

As many have mentioned in the comments, running out of space in the PermGen is very likely to be your cause. This can be caused by excessive interning of Strings throughout the code, which can lead to the PermGen "exploding" - also loading a lot of classes (usually via frameworks that do it for you in the background) may also cause this.

Also, as one of the comments mentioned - using the CMS collection (Concurrent Mark and Sweep) can reduce your Stop the World GCs assuming the problem is with the Capacity of the Old generation. It can also improve your performance by reducing latency which is good regardless of the current problem.

Also, if you post fragments of the GC log that can be helpful to pointing you to the right direction.

About the jstat tool, you can use it in the following manner to get useful information:

jstat -gcutil <pid> <interval> 

I usually use a 1000ms interval. -gcutil gives you GC Utilization (in %) - so you can see if any generation is approaching 100%.

You can also use jstat -gc <pid> ... and get the exact capacity of your Old generation.


EDIT: After reviewing GC Logs

According to your GC Logs it confirms the original premise that your PermGen is filling up. In the time range between 10:03:47 and 10:03:55 I can see that the PermGen is constantly reaching it's maximum and than GC somehow removes around 10-100 KB of data:

See the below:

2013-03-22T10:03:47.631+1100: 1504185.912: [Full GC [... [PSPermGen: 358399K->358278K(358400K)]...
2013-03-22T10:03:53.175+1100: 1504191.456: [Full GC [... [PSPermGen: 358399K->358371K(358400K)]...
2013-03-22T10:03:55.802+1100: 1504194.082: [Full GC [... [PSPermGen: 358399K->358389K(358400K)]...

If you look at the Old and Young generations you can see they are both not reaching their maximum, the OldGen is consuming 4GB out of 10GB - so it is not the cause.

From the data you gathered I cannot tell if the rate at which the PermGen is filling up is consistent with the up-time of the process - meaning it looks like the PermGen should fill up in a day, not 20 days. So it is hard to say what is the definite solution, but here are some recommendations:

  1. Examine your code to ensure you are not abusing String's intern() method - if you use it too widely in the code for no good reason this can be your root cause.
  2. Check the frameworks you are using if they generate classes dynamically - this will consume the PermGen space as well, but to a certain extent.
  3. If you can, perform a weekly restart to your processes to prevent this downtime
  4. Consider increasing the PermGen space, but monitor it since increasing it might just prolong your 20day period, but not solve the problem. The PermGen should remain quite static after the process has been around long enough.
  5. Running a search on Google with the string Coldfusion PermGen produced many hits that report issues - try to follow those to focus you on your investigation.
RonK
  • 9,472
  • 8
  • 51
  • 87
  • Thank you for your answer - I have updated the original question with some fragments of the gc logs. – Dave Mar 22 '13 at 22:39