Too much Full GC happening in CMS

Question

I'm observing frequent GC including Full GC in my java application under load off 100 clients making a web service call which return a JSON of 21KB (compressed size) to client. Uncompressed JSON will be around 200 KB. Beacause of frequent GC, I'm observing huge latency of 2.6 seconds. Here the JVM option used for the load test

-Xloggc:/mnt/apache-tomcat-7.0.29/logs/gc.log -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintTenuringDistribution -XX:+UseParNewGC -Xms5048M  -XX:MaxTenuringThreshold=2 -XX:SoftRefLRUPolicyMSPerMB=73 -Djava.net.preferIPv4Stack=true -server -XX:NewRatio=4 -XX:SurvivorRatio=8

I'm using AWS xLarge instance with 4vCPU and 15 MB RAM.

Here are the GCl logs produced by the process

 249.988: [GC 249.988: [ParNew
2163 Desired survivor size 52920320 bytes, new threshold 1 (max 2)
2164 - age   1:   83821968 bytes,   83821968 total
2165 - age   2:   20181408 bytes,  104003376 total
2166 : 896337K->103360K(930432K), 0.0409860 secs] 4453729K->3663260K(5065792K), 0.0412220 secs] [Times: user=0.16 sys=0.00, real=0.04 secs]

2167 250.051: [GC 250.051: [ParNew
2168 Desired survivor size 52920320 bytes, new threshold 2 (max 2)
2169 - age   1:   24997144 bytes,   24997144 total

2170 : 185613K->82956K(930432K), 0.0620620 secs] 3745513K->3744426K(5065792K), 0.0622810 secs] [Times: user=0.22 sys=0.00, real=0.06 secs]

2171 250.408: [GC 250.409: [ParNew
2172 Desired survivor size 52920320 bytes, new threshold 1 (max 2)
2173 - age   1:   86010184 bytes,   86010184 total
2174 - age   2:   19119656 bytes,  105129840 total


2183 : 930432K->103360K(930432K), 0.0904080 secs] 4767408K->4080180K(5065792K), 0.0906520 secs] [Times: user=0.33 sys=0.01, real=0.09 secs]
2184 251.229: [GC 251.229: [ParNew: 215470K->215470K(930432K), 0.0000370 secs] 4192291K->4192291K(5065792K), 0.0001620 secs] [Times: user=0.00 sys=0.00, real     =0.00 secs]

2185 GC locker: Trying a full collection because scavenge failed
2186 251.229: [Full GC 251.229: [CMS251.495: [CMS-concurrent-mark: 6.443/8.877 secs] [Times: user=30.14 sys=3.55, real=8.87 secs]
2187  (concurrent mode failure): 3976820K->354739K(4135360K), 0.9792960 secs] 4192291K->354739K(5065792K), [CMS Perm : 42722K->42722K(71212K)], 0.9794580 sec     s] [Times: user=0.96 sys=0.02, real=0.98 secs]
2188 252.439: [GC 252.440: [ParNew
2189 Desired survivor size 52920320 bytes, new threshold 1 (max 2)
2190 - age   1:  105807496 bytes,  105807496 total
2191 : 837144K->103360K(930432K), 0.0562310 secs] 1191884K->474282K(5065792K), 0.0564520 secs] [Times: user=0.22 sys=0.00, real=0.05 secs]
2192 252.676: [GC 252.676: [ParNew
2193 Desired survivor size 52920320 bytes, new threshold 1 (max 2)
2194 - age   1:   95836672 bytes,   95836672 total
2195 : 924377K->103360K(930432K), 0.0772210 secs] 1295299K->575312K(5065792K), 0.0774650 secs] [Times: user=0.28 sys=0.01, real=0.08 secs]
2196 253.001: [GC 253.001: [ParNew
2197 Desired survivor size 52920320 bytes, new threshold 1 (max 2)
2198 - age   1:   99903544 bytes,   99903544 total

2199 : 930432K->103360K(930432K), 0.0712130 secs] 1405673K->680963K(5065792K), 0.0714420 secs] [Times: user=0.25 sys=0.01, real=0.07 secs]

2200 253.324: [GC 253.324: [ParNew
2201 Desired survivor size 52920320 bytes, new threshold 1 (max 2)
2202 - age   1:  105814880 bytes,  105814880 total

2203 : 930432K->103360K(930432K), 0.0736730 secs] 1513770K->807662K(5065792K), 0.0739490 secs] [Times: user=0.27 sys=0.00, real=0.08 secs]
2204 253.666: [GC 253.667: [ParNew

score 3 · Answer 1 · edited May 23 '17 at 11:50

3

I wonder why the drive-bys are simply dinging this question without at least pausing to comment as to why they would do so... That bugs me.

Focusing on this message:

GC locker: Trying a full collection because scavenge failed

I found this thread:

https://blogs.oracle.com/poonam/entry/understanding_cms_gc_logs

With this information:

CMS can also be run in incremental mode (i-cms), enabled with -XX:+CMSIncrementalMode.

Are you running CMS in (i-cms) mode? It appears it could help to me, though I admit this isn't my expertise. Just stopped by to help.

Herein would be some further discussion of this issue and possible solutions:

https://forums.oracle.com/thread/1543499

Also, focusing on the message "concurrent mode failure", which is a major issue: How to reduce java concurrent mode failure and excessive gc

edited May 23 '17 at 11:50

Community

1
1

answered Aug 15 '13 at 11:30

Richard_G

4,700
3
42
78

For Nishi Kant and/or Mat: Did this help? Have you made progress? Do you have any feedback so that I could help further? – Richard_G Aug 16 '13 at 12:29
Not much progess, i did try lots of switch to do away with "Concurrent mode failure" issue but no success. I'm observing same message event if load is reduced to 30 users. – Nishi Kant Aug 19 '13 at 14:29
It looks like incremental CMS will soon be deprecated. Shame because it did fix this very issue for me. "Java HotSpot(TM) Client VM warning: Using incremental CMS is deprecated and will likely be removed in a future release" – eric_the_animal Apr 01 '16 at 00:47

score 2 · Answer 2 · answered Feb 15 '16 at 14:57

Let's summarize your memory settings:

I suppose that the mentioned 15 MB are a typo. Assuming that your machine has less than 19.7 GB of RAM, the option "-Xms5048M" effectively sets minimum and maximum heap size to 5048 MB (the default max is 25% of machine RAM, so the max will be higher if you have >19.7 GB RAM).

Together with options -XX:NewRatio=4 and -XX:SurvivorRatio=8, what you get is the following memory layout:

Total heap size: 5048 MB; according to new ratio, this consists of
- 80%, i.e., 4038.4 MB tenured (old gen) space
- 20%, i.e., 1009.6 MB young gen space; according to the selected survivor ratio, that will consist of
  - 80%, i.e. 807.68 MB Eden space and
  - 10%, i.e. 101.96 MB S0 (Survivor) space
  - 10%, i.e. 101.96 MB S1 (Survivor) space

Now let's check what the JVM makes out of that:

From your logs, you can see that the survivor space is very small; e.g., on line 2174, the sum of age 1 and 2 entries is 182 MB, thus exceeding the S0/S1 size by far. The low max age (1 or 2 generations) is also a sign of survivor space being too limited -- JVM will reduce the max age as it tries to trim survivor space utilization to the target value (50% by default).

The report line on line 2183 indicates that 136.5 MB were promoted from young to tenured space (young gen diff after GC is 930432 KB - 103360 KB = 827072 KB, total heap diff is 4767408 KB - 4080180 KB = 687228 KB, thus 139844 KB promoted).

On the same line 2183, you can see that the old gen usage is at 4080180 KB - 103360 KB = 3976820 KB = 3883.6 MB. This leaves only 154.8 MB to the max old gen size.

I'm not into the details of the GC algorithm triggers (and from what I learned, they do change between JVM versions), but considering

the size of the last promotion (136.5 MB) and
the amount of old gen space that's left for the next promotion (154.8 MB)

...it's clear that a Full GC is urgently needed.

As a solution, I'd try increasing the young gen size

this will relax the extremely tight S0/S1 setup
more objects will have a chance to die young, so that
the promotion rate to tenured space will reduce, and
CMS concurrent mode has enough time to complete before the "Full GC" emergency brake must be triggered.

Of course, increasing the total heap size will also help, as may further tuning of the SurvivorRatio value; especially the latter is highly application-dependent, though, and I wouldn't touch it without a deep understanding of both the application and the GC algorithm.

Too much Full GC happening in CMS

2 Answers2