Java CMS GC Behaviours

Question

I have an application that cause to creating lots of garbage. First (and almost one) criteria is low GC pause time. I try different GC parameters using visualgc tool (and gc logs). Best parameters are below.

-XX:+UseConcMarkSweepGC

-Xmx1172M

-Xms600M

-XX:+UseParNewGC

-XX:NewSize=150M

My application run on SunOS 10 with Java 1.6.0_21 . Hardware is 2 x CPU quad core (uname -X result is numCPU = 8).

Questions are

Observing GC behaviours, New objects creating on eden space until eden is full. When eden space full GC runs, clear garbage, if object is not dead copy to Old-gen (I discard 'from' & 'to' spaces), Similarly Old-Gen is full, GC runs with CMS-concurrent phase and clear Old-gen space. Some part of CMS is Stop-the-world (pause time). This is a loop.

Is above scenerio true?
After GC clean old-gen space, there is no enough space expand old-gen space(XMS and XMS values are different) ?
When full GC operation start? How to decided it?
CMS-concurrent phase duration depends on Eden space size, actually my expectation is, Eden space does not effect CMS-concurrent phase duration. What is going on GC related with eden space on CMS-concurrent phase?
What else suggest to me minimizing pause time? Indeed, Most valuable answer for me :)

Thanks

score 10 · Accepted Answer · edited Jul 30 '12 at 01:01

you can't just ignore the survivor spaces when using CMS. CMS is not a compacting collector which means that if you (or the JVM) gets the tenuring threshold wrong then you will slowly bleed objects into tenured which will increase the rate at which tenured fragments which will bring forward the time when CMS is forced because it has insufficient contiguous free space available to handle promotions from the survivor spaces into tenured which will force a full gc cycle with no advance warning and hence it's the full thing in 1 STW pause. How long this takes will depend on the size of your heap but one thing is highly likely, it will be orders of magnitude longer than a normal eden collection.

There are a few other things to note here;

STW pauses do not only come from CMS, they come from the young gen collector too
CMS has 2 STW phases (mark and remark) and 3-4 concurrent phases, the 1st STW phase (mark) is strictly singlethreaded which can cause issues (sample discussion on this here)
You can control the no of threads handling the concurrent phases
You need to understand how long objects tend to live for, this may mean use of -XX:+PrintTenuringDistribution or you can just watch it with visualgc like you've done
You can then tune this with -XX:SurvivorRatio to control the size of the survivor spaces relative to eden and -XX:MaxTenuringThreshold to control how often an object can survive a young collection before it is tenured
-XX:CMSInitiatingOccupancyFraction can be used to guide CMS as to how full it needs to be before it starts the CMS phase (get this wrong and you'll pause badly)

Ultimately you need to understand which collector is pausing, how often, for how long and whether there are any abnormal causes of that pause. You then need to compare this against the size of each generation to see whether you can tweak the parameters so as to minimise the number (and/or the duration) of pauses.

Bear in mind that this can be timesink due to the need for long running tests to see whether it deteriorates over time. Also without a repeatable, automated workload, it's nigh on impossible to draw any firm conclusions as to whether you've actually improved things.

One good source of summary info on the internals is Jon Masamitsu's blog. Another good presentation on this is GC Tuning in the HotSpot Java VM.

After 20 hours, gc logs about 5 times full gc running, I guess some clues why running Full GC are "promotion failure" & "concurrent mode failure". Search on google these reasons. Shortly, Increment old generation size for "promotion failure" and set min value XX:CMSInitiatingOccupancyFraction for "concurrent mode failure". I will try set XX:CMSInitiatingOccupancyFraction small value (like 30 or 60) and increment Heap. I will share test result. — Erdinç Taşkın, Mar 17 '11 at 09:59
promotion failure is usually the fragmentation issue I mentioned which forces a non concurrent full gc. You need to examine your tenuring threshold and size them appropriately. Setting initiating occupancy to a low value (default is 70 iirc) will just mean more frequent full gcs that don't do much which is not good. Do you even have much that lives for a long time? You may find a massive eden and a tiny tenured is a good option. — Matt, Mar 17 '11 at 10:09
Low initiating occupancy value is more frequently CMS but no problem. Biggest problem STW while 2-3 seconds. Throughput or 0.0x second STW is not problem for my case. I have tried big eden size but STW duration is increased :( How to set number of threads on concurrent phase? — Erdinç Taşkın, Mar 17 '11 at 12:03
can you print the gc log output for that pause? it would be interesting to see what it spent the time doing. Use `ParallelCMSThreads` for setting no of threads, explanation [here](http://blogs.sun.com/jonthecollector/entry/did_you_know) — Matt, Mar 17 '11 at 16:01

score 2 · Answer 2 · edited Mar 16 '11 at 16:35

The best way to minimise the impact of GC is to minise the number of object object you create. This is not always easy to do or the best solution overall, but it will minimise the GC pauses.

If you can't produce less objects, try to make them short lived enough and the eden space large enough that they don't leave the eden space. (Or make the very long lived and re-used)

There are three spaces to worry about here, eden -> survivor -> tenured http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html
The GC tries to ensure there is enough free space after a full GC and the -ms and -mx options do control how large they will be (former known as -Xms and -Xmx)
A full GC starts when the tenured space is full, or the suvivor space is exhaused (e.g. there are too many object copied from the eden space) or the CMS desices now is a good tile to try and do a concurrent cleanup.
The CMS only cleans up the tenured space.
See my previous answers.

I agree with you about incrementing eden space decision. I already tried different newSize parameters and check to pause time from gc log that line includes "Rescan". Less newSize values cause less pause time. 3 different newSize values are parallel my inference. — Erdinç Taşkın, Mar 16 '11 at 09:17

Java CMS GC Behaviours

2 Answers2

Linked