26

We have started having some serious problems with our Java EE application. Specifically, the application runs up to 99% of old generation heap within minutes after start up. No OOMs are thrown, but effectively the JVM is unresponsive. The jstat shows that old generation does not decrease in size at all, no garbage collection is going on, and kill -3 says:

Heap
 PSYoungGen      total 682688K, used 506415K [0xc1840000, 0xf3840000, 0xf3840000)
  eden space 546176K, 92% used [0xc1840000,0xe06cd020,0xe2da0000)
  from space 136512K, 0% used [0xe2da0000,0xe2da0000,0xeb2f0000)
  to   space 136512K, 0% used [0xeb2f0000,0xeb2f0000,0xf3840000)
 PSOldGen        total 1536000K, used 1535999K [0x63c40000, 0xc1840000, 0xc1840000)
  object space 1536000K, 99% used [0x63c40000,0xc183fff8,0xc1840000)

The VM options are :

-Xmx2300m -Xms2300m -XX:NewSize=800m -XX:MaxNewSize=800m -XX:SurvivorRatio=4 -XX:PermSize=256m -XX:MaxPermSize=256m -XX:+UseParallelGC -XX:ParallelGCThreads=4 

(I changed it from having 2300m heap/1800m new gen, as an attempt to resolve the problem)

I took a heap dump of the JVM once it got to "out of memory" state (took forever) and ran Eclipse Memory Analyzer on it.

The results are quite funny. About 200Mb is occupied by objects of all kinds (there are some that own more than others), but the rest, 1.9Gb are all unreachable (may be noteworthy to say that majority is occupied by GSON objects, but I don't think it's an indication of anything, only says that we churn through a lot of GSON objects during server operation).

Any explanation as to why the VM is having so many unreachable objects, and is uncapable of collecting them at all?

JVM:

$ /0/bin/java -version
java version "1.6.0_37"
Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
Java HotSpot(TM) Server VM (build 20.12-b01, mixed mode)

When the system gets to this stall, here is what verbose GC keep printing out:

922.485: [GC [1 CMS-initial-mark: 511999K(512000K)] 1952308K(2048000K), 3.9069700 secs] [Times: user=3.91 sys=0.00, real=3.91 secs] 
926.392: [CMS-concurrent-mark-start]
927.401: [Full GC 927.401: [CMS927.779: [CMS-concurrent-mark: 1.215/1.386 secs] [Times: user=5.84 sys=0.13, real=1.38 secs] (concurrent mode failure): 511999K->511999K(512000K), 9.4827600 secs] 2047999K->1957315K(2048000K), [CMS Perm : 115315K->115301K(262144K)], 9.4829860 secs] [Times: user=9.78 sys=0.01, real=9.49 secs] 
937.746: [Full GC 937.746: [CMS: 512000K->511999K(512000K), 8.8891390 secs] 2047999K->1962252K(2048000K), [CMS Perm : 115302K->115302K(262144K)], 8.8893810 secs] [Times: user=8.89 sys=0.01, real=8.89 secs] 

SOLVED

As Paul Bellora suggested, this was caused by too large amount of objects created within the JVM, in too short period of a time. Debugging becomes quite tedious at this point. What I ended up doing, is, instrumenting the classes using custom JVM agent. The instrumentation would count method and constructor invocations. The counts were then examined. I found that an inconspicuous single operation would create about 2 million objects, and trigger certain individual methods about 1.5 million times (no, there were no loops). The operation itself was identified by being slow comparing to others. You can use any hotspot profiler as well (something like visualVM), but I had all kinds of troubles with those, so ended up writing my own.

I still think the behavior of the JVM is a mystery. It looks like the garbage collector comes to a stall, and will not clean any more memory, yet the memory allocator expects it to (and thus no OOMs are thrown). Instead, I would have expected it to clear out all that unreachable memory. But the application behavior wouldn't be much better off, as majority of the time would have been spent garbage collecting anyway.

The agent that I used for help can be found here : https://github.com/veselov/MethodCountAgent. It's far away from being a polished piece of software.

Pawel Veselov
  • 3,996
  • 7
  • 44
  • 62
  • What does the GC log show? How are the minor collections doing? Any FullGCs? Also, you can try to collect a few histograms? – Aleš Jan 17 '13 at 01:31
  • @AlesJ. It wasn't running with verbose gc, but jstat showed quick newgen churn and oldgen growth, until old gen saturated. – Pawel Veselov Jan 17 '13 at 01:46
  • But then I am not sure if you can say that `there is no garbage collection going on` when there is no gc log showing it. Also, the MAT shows only 200Mb but it does not mean that the rest (1.9GB) is unreachable... There was probably a FullGC cycle during the Heap Dump. A histogram snapshot will be more precise. – Aleš Jan 17 '13 at 01:54
  • 1
    The GC log will show you what is the promotion rate and live data set size, then you will see if there is a memory leak or if you need to change sizes of your Generations. SEe here on howto analyze GC issues : http://stackoverflow.com/q/6871213/772000 – Aleš Jan 17 '13 at 01:56
  • Definitely turn on GC logging. Also for kicks, try using `-XX:+UseConcMarkSweepGC` instead of `-XX:+UseParallelGC`. – cambecc Jan 17 '13 at 01:59
  • @AlesJ. May be GC is in progress, but takes forever (but would gc log show it?). I said that no gc going on - is because jstat doesn't show a single byte drop in OU. I also know that the rest is unreachable is because MAT actually shows me the objects that are unreachable, and those total to 1.9G. Taking the dump (jmap) took good 10min, and I had to use -F – Pawel Veselov Jan 17 '13 at 02:13
  • @cambecc I did change GC to -XX:+UseConcMarkSweepGC -XX:+UseParNewGC, and set up gc to verbose on one of the nodes. Considering these are production systems, kicks are not encouraging :) But I did find some references that XX:+UseParallelGC is bad for smaller heaps – Pawel Veselov Jan 17 '13 at 02:21
  • 3
    This reminds me of an issue my team had to solve some time ago. We analyzed heap dumps but found no traditional memory leak. The only clue was that the vast majority of objects were unreachable. The culprit turned out to be twitter4j - paginated calls to the twitter API continued to return the same page of data, indefinitely spamming new json objects. I wonder if something similar could be happening here. – Paul Bellora Jan 17 '13 at 02:29
  • @PaulBellora That sounds very similar. At the top of my unreachable list there is com.google.gson.internal.StringMap$LinkedEntry with 25M entires, taking on 804M of heap. But it's the nature of my app to process json requests. My concern is GC not being to clean those away fast enough, and I believe it should. – Pawel Veselov Jan 17 '13 at 02:42
  • @PawelVeselov I guess the question is whether you should have to expect that level of throughput. Try to isolate the spammy responses and measure/debug to determine whether it's a true need or an errored state like in our case. Then you'll know whether it's a bug or scalability issue. – Paul Bellora Jan 17 '13 at 02:56
  • Did you perhaps think that your application has poor design or coding problems, rather than blaming the VM? Not saying it does, but in my (limited) experience, a lot of times I want to blame something other than myself when it's really all my fault... :) – KyleM Jan 17 '13 at 22:50
  • 1
    @KyleM I have not been blaming the VM, at least I didn't notice that I was. I'm trying to figure out where to even start digging. – Pawel Veselov Jan 17 '13 at 23:08
  • did you happen to publish the agent you created on github? Any other references for the same? – Sachin Sharma Dec 23 '19 at 12:03
  • 1
    @SachinSharma I've published the agent (mind you, haven't used it since then), and added the link to the question text above. – Pawel Veselov Dec 24 '19 at 23:47

2 Answers2

15

Any explanation as to why the VM is having so many unreachable objects, and is uncapable of collecting them at all?

(Based on our exchange in the comments) it sounds like this is not a traditional memory leak but some piece of logic that is continuously spamming new objects such that the GC struggles to keep up under the current architecture.

The culprit could be for example some API request that is being made many, many times, or else is "stuck" in some erroneous state like the infinite pagination scenario I described. What either situation boils down to is millions of response gson objects (which point to Strings (which point to char[]s)) being instantiated and then becoming eligible for GC.

As I said you should try and isolate problem request(s), then debug and take measurements to decide whether this is a bug or scalability issue on the part of your application or one of its libraries.

Paul Bellora
  • 54,340
  • 18
  • 130
  • 181
  • i would like to see some of the source code, especially loops – TecHunter Jan 17 '13 at 05:55
  • In my case, a set of APIs are getting called many many times and I am getting OOM error after some time. Upon debugging heap dump using MAT, it seems it is due to `com.fasterxml.jackson.databind.deser.Deserializers[]` as its objects occupy most of the `Shallow heap`. Any idea what could be done here to resolve this (besides reducing the traffic / increasing the server) ? Is there a way we can make sure that such objects are deleted from memory so that such issue does not occur ? – user Nov 04 '20 at 18:47
  • @user I think the best way to get help with that specific issue will be to open a new question – Paul Bellora Nov 05 '20 at 20:00
2

Based on your stats listed, I find it hard to believe you have 1.9G of unreachable data. It looks more like a GC Overhead Limit Reached.

Consider

937.746: [Full GC 937.746: [CMS: 512000K->511999K(512000K), 8.8891390 secs] 2047999K->1962252K(2048000K), [CMS Perm : 115302K->115302K(262144K)], 8.8893810 secs] [Times: user=8.89 sys=0.01, real=8.89 secs]

If this is true, then a Full GC releases 85K of data. If you did have 1.9G of unreachable code, you would see 2047999 -> ~300000.

Also

object space 1536000K, 99% 

Implies something was created and stored to in such a way it escaped a method and is now living probably forever.

I would need to see more evidence that you have 1.9G of unreachable data other then simply being told.

John Vint
  • 39,695
  • 7
  • 78
  • 108
  • 4
    I have a 2.5Gb heap dump file that says so :) Is there any analysis that I can do on that heap dump that would be reasonable? I'm as puzzled about this, but MAT says exactly that - 1.9Gb unreachable objects, and GC says what it's saying... – Pawel Veselov Jan 17 '13 at 20:31
  • I do believe you have you have 2.5G heap dump, but based on the two GC logs then at that moment everything was reachable. Do you *GC Overhead Limit reached* in your output? When you load the heap dump through eclipse (default parsing) what numbers are shown to you at first? (Maybe a screen shot may help). – John Vint Jan 17 '13 at 22:33
  • I added the screen shots (I hope these the ones that you meant). I don't see "Overhead Limit reached" in the GC output. – Pawel Veselov Jan 17 '13 at 22:45
  • And at no point it resolves itself? As you have already said everything looks fine. The heap was taken while it still had a 2G Old Gen space filled? – John Vint Jan 17 '13 at 23:13
  • The heap dump was taken when the VM was at that stall, with OldGen filled up. I did not ever see it resolve itself, but we never had a chance to wait until the load down enough to test so. The heap dumping itself took good 15 minutes, and had to be done with jmap -F. – Pawel Veselov Jan 17 '13 at 23:17