Over the last couple of days, we are seeing JVM's on our servers get into a state where they are spending 100% CPU time in GC of OldGen, when:
A. They don't need to because there is plenty of space left on the heap and
B. They are not reclaiming anything.
I know they are in GC by looking at a stack trace and correlating the ThreadID's in ProcessExplorer to those in the stack dump. Each GC thread is occupying around 4% CPU.
Servers are running 16 gig heaps (32gig physical RAM) & have 8 cores. Up-time is usually around 30 days, with reboots required only due to MS patching requirements, but currently they are crashing at the 20 day mark.
Here is a graph of Duration, time scale = 19 days. http://i45.tinypic.com/257qalu.png
Here is a zoom of the tail of that graph http://i48.tinypic.com/2duiccw.png
As you can see, duration increases dramatically.
Here is a graph of the heap usage after GC. http://i48.tinypic.com/znna4h.png
If it was a typical memory leak, I would expect to see the orange peaks get higher and higher until they can't peak any more, but as this graph shows, there is plenty of heap space left.
I've got heap dumps for each server and nothing stands out as being problematic. There are a few ehCache stores, I can see our application code, ie, just "normal stuff"
The biggest change we made ~20 days ago was to implement a vendor patch that changed an internal cache from an unbounded hashmap which used hard references (and an obvious memory leak) to one that consisted of soft references & I am wondering if this is the cause, ie, somehow there is a massive overhead in managing these soft references after a point?
Does anyone have any ideas as to where to look next, or can someone confirm my soft reference theory?
Here is my jvm.args:
java.args=-server -Xms16000m -Xmx16000m -Dsun.io.useCanonCaches=false -XX:MaxPermSize=350m -Xloggc:e:/gcLogs/eRGCLogs.txt -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseParallelGC -XX:+UseParallelOldGC -Dnet.sf.ehcache.sizeof.filter=D:/jo3/java_ehCacheOpenSource/sizeOfExclusions.config -Xbatch -Dcoldfusion.rootDir={application.home}/../ -Dcoldfusion.libPath={application.home}/../lib -Dcoldfusion.classPath={application.home}/../lib/updates,{application.home}/../lib,{application.home}/../gateway/lib/,{application.home}/../wwwroot/WEB-INF/flex/jars,{application.home}/../wwwroot/WEB-INF/cfform/jars,d:/jo3/java,d:/JO3/java_ehCacheOpenSource/,D:/jo3/java_ehCacheMonitorProbe
We are on Coldfusion, which is kind of like a massive framework that sits on top of java.
JVM version: 1.6.0_29
As requested, "normal" GC logs look like:
2013-03-19T22:11:36.670+1100: 1288665.702: [GC [PSYoungGen: 4695800K->471119K(4722112K)] 9301727K->5077046K(15644800K), 0.3584434 secs] [Times: user=5.01 sys=0.00, real=0.36 secs] 2013-03-19T22:14:55.078+1100: 1288864.099: [GC [PSYoungGen: 4722063K->498009K(4783104K)] 9327990K->5103936K(15705792K), 0.3766889 secs] [Times: user=5.37 sys=0.00, real=0.38 secs] 2013-03-19T22:17:46.749+1100: 1289035.760: [GC [PSYoungGen: 4654489K->517299K(4673792K)] 9260416K->5123227K(15596480K), 0.4130828 secs] [Times: user=5.80 sys=0.00, real=0.41 secs] 2013-03-19T22:21:08.762+1100: 1289237.763: [GC [PSYoungGen: 4673779K->522660K(4738880K)] 9279707K->5143831K(15661568K), 0.4005516 secs] [Times: user=5.97 sys=0.00, real=0.40 secs] 2013-03-19T22:23:42.683+1100: 1289391.675: [GC [PSYoungGen: 4582628K->530998K(4590976K)] 9203799K->5186242K(15513664K), 0.4317352 secs] [Times: user=6.24 sys=0.00, real=0.43 secs] 2013-03-19T22:26:11.096+1100: 1289540.080: [GC [PSYoungGen: 4590966K->518331K(4724096K)] 9246210K->5206959K(15646784K), 0.3914401 secs] [Times: user=5.99 sys=0.00, real=0.39 secs] 2013-03-19T22:27:44.076+1100: 1289633.055: [GC [PSYoungGen: 2602730K->447527K(4732864K)] 7291358K->5208743K(15655552K), 0.3725317 secs] [Times: user=5.80 sys=0.00, real=0.37 secs] 2013-03-19T22:27:44.448+1100: 1289633.428: [Full GC (System) [PSYoungGen: 447527K->0K(4732864K)] [ParOldGen: 4761215K->4628296K(10922688K)] 5208743K->4628296K(15655552K) [PSPermGen: 352378K->352287K(352832K)], 4.2955639 secs] [Times: user=57.70 sys=0.06, real=4.30 secs] 2013-03-19T22:30:37.950+1100: 1289806.920: [GC [PSYoungGen: 4004416K->70948K(4690432K)] 8632712K->4699245K(15613120K), 0.1062227 secs] [Times: user=0.76 sys=0.00, real=0.11 secs] 2013-03-19T22:33:27.154+1100: 1289976.115: [GC [PSYoungGen: 4054116K->109175K(4092352K)] 8682413K->4737472K(15015040K), 0.1347919 secs] [Times: user=1.03 sys=0.00, real=0.13 secs] 2013-03-19T22:36:32.120+1100: 1290161.070: [GC [PSYoungGen: 4092343K->147318K(4712320K)] 8720640K->4775615K(15635008K), 0.1593523 secs] [Times: user=1.58 sys=0.00, real=0.16 secs] 2
When we are in failure mode, the GC logs look like:
2013-03-22T10:03:47.619+1100: 1504185.901: [GC [PSYoungGen: 0K->0K(5452736K)] 4413907K->4413907K(16375424K), 0.0114248 secs] [Times: user=0.16 sys=0.00, real=0.01 secs] 2013-03-22T10:03:47.631+1100: 1504185.912: [Full GC [PSYoungGen: 0K->0K(5452736K)] [ParOldGen: 4413907K->4412613K(10922688K)] 4413907K->4412613K(16375424K) [PSPermGen: 358399K->358278K(358400K)], 5.4435442 secs] [Times: user=73.74 sys=0.14, real=5.44 secs] 2013-03-22T10:03:53.145+1100: 1504191.426: [GC [PSYoungGen: 269219K->7734K(5449088K)] 4681833K->4422114K(16371776K), 0.0298728 secs] [Times: user=0.34 sys=0.00, real=0.03 secs] 2013-03-22T10:03:53.175+1100: 1504191.456: [Full GC [PSYoungGen: 7734K->0K(5449088K)] [ParOldGen: 4414379K->4415189K(10922688K)] 4422114K->4415189K(16371776K) [PSPermGen: 358399K->358371K(358400K)], 2.6033684 secs] [Times: user=36.33 sys=0.00, real=2.60 secs] 2013-03-22T10:03:55.788+1100: 1504194.069: [GC [PSYoungGen: 94969K->826K(5451328K)] 4510158K->4416015K(16374016K), 0.0133588 secs] [Times: user=0.16 sys=0.00, real=0.01 secs] 2013-03-22T10:03:55.802+1100: 1504194.082: [Full GC [PSYoungGen: 826K->0K(5451328K)] [ParOldGen: 4415189K->4415348K(10922688K)] 4416015K->4415348K(16374016K) [PSPermGen: 358399K->358389K(358400K)], 2.7156884 secs] [Times: user=38.11 sys=0.00, real=2.71 secs] 2