11

Occasionally, somewhere between once every 2 days to once every 2 weeks, my application crashes in a seemingly random location in the code with: java.lang.OutOfMemoryError: GC overhead limit exceeded. If I google this error I come to this SO question and that lead me to this piece of sun documentation which expains:

The parallel collector will throw an OutOfMemoryError if too much time is being spent in garbage collection: if more than 98% of the total time is spent in garbage collection and less than 2% of the heap is recovered, an OutOfMemoryError will be thrown. This feature is designed to prevent applications from running for an extended period of time while making little or no progress because the heap is too small. If necessary, this feature can be disabled by adding the option -XX:-UseGCOverheadLimit to the command line.

Which tells me that my application is apparently spending 98% of the total time in garbage collection to recover only 2% of the heap.

But 98% of what time? 98% of the entire two weeks the application has been running? 98% of the last millisecond?

I'm trying to determine a best approach to actually solving this issue rather than just using -XX:-UseGCOverheadLimit but I feel a need to better understand the issue I'm solving.

Community
  • 1
  • 1
jilles de wit
  • 7,060
  • 3
  • 26
  • 50
  • 3
    From the docs, it seems to be 98% of the entire 2 weeks. Have you enabled GC logs with these flags -verbose:gc -XX:+PrintGCDetails XX:+PrintGCTimeStamps –Xloggc:PATH_FROM_ROOT/gclog.log. Would be good to see the App running time and stopped time due to GC. – JoseK May 19 '10 at 12:28
  • GC logging is a nice suggestion I'll try that. 98% of 2 weeks seems unlikely but you are right, that is what the docs imply. I hope it is just imprecise writing – jilles de wit May 20 '10 at 10:19
  • Did you find out the meaning of 98% of time? My view is that GC should be busy taking up 98% of application utilization at the very moment the exception occurs and not for the 2 weeks. – Monis Iqbal Aug 25 '10 at 10:01
  • @Monis: I haven't found it out, and gave up looking. 98% of the time "at the very moment" doesn't make much sense because a moment is by definition not a period of time so "98% of a moment" can't be either (and is just as "long" as 2% of a moment). – jilles de wit Sep 06 '10 at 13:19

3 Answers3

7

I'm trying to determine a best approach to actually solving this issue rather than just using -XX:-UseGCOverheadLimit but I feel a need to better understand the issue I'm solving.

Well, you're using too much memory - and from the sound of it, it's probably because of a slow memory leak.

You can try increasing the heap size with -Xmx, which would help if this isn't a memory leak but a sign that your app actually needs a lot of heap and the setting you currently have is slightly to low. If it is a memory leak, this'll just postpone the inevitable.

To investigate if it is a memory leak, instruct the VM to dump heap on OOM using the -XX:+HeapDumpOnOutOfMemoryError switch, and then analyze the heap dump to see if there are more objects of some kind than there should be. http://blogs.oracle.com/alanb/entry/heap_dumps_are_back_with is a pretty good place to start.


Edit: As fate would have it, I happened to run into this problem myself just a day after this question was asked, in a batch-style app. This was not caused by a memory leak, and increasing heap size didn't help, either. What I did was actually to decrease heap size (from 1GB to 256MB) to make full GCs faster (though somewhat more frequent). YMMV, but it's worth a shot.

Edit 2: Not all problems solved by smaller heap... next step was enabling the G1 garbage collector which seems to do a better job than CMS.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
gustafc
  • 28,465
  • 7
  • 73
  • 99
  • I'm trying some profiling, and I'll try that one as well. Thanks. – jilles de wit May 20 '10 at 10:23
  • I went a similar route as you did, experimenting with parameters. Eventually, increasing the heap size and some tweaking of my code (I have found no memory leaks though) seems to have solved my problem. – jilles de wit Sep 06 '10 at 13:22
  • why would you use CMS or G1 in a batch-style app? Isn't throughput collector better? – endless Mar 12 '13 at 03:07
1

But 98% of what time? 98% of the entire two weeks the application has been running? 98% of the last millisecond?

The simple answer is that it is not specified. However, in practice the heuristic "works", so it cannot be either of the two extreme interpretations that you posited.

If you really wanted to find out what the interval over which the measurements are made, you could always read the OpenJDK 6 or 7 source-code. But I wouldn't bother because it wouldn't help you solve your problem.

The "best" approach is to do some reading on tuning (starting with the Oracle / Sun pages), and then carefully "twiddle the tuning knobs". It is not very scientific, but the problem space (accurately predicting application + GC performance) is "too hard" given the tools that are currently available.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
1

The >98% would be measured over the same period in which less than 2% of memory is recovered.

It's quite possible that there is no fixed period for this. For instance, if the OOM check would be done after every 1,000,000 object live checks. The time that takes would be machine-dependent.

You most likely can't "solve" your problem by adding -XX:-UseGCOverheadLimit. The most likely result is that your application will slow to a crawl, use a bit more of memory, and then hit the point where the GC simply does not recover any memory anymore. Instead, fix your memory leaks and then (if still needed) increase your heap size.

MSalters
  • 173,980
  • 10
  • 155
  • 350