16

I am working with a team developing a Java GUI application running on a 1GB Linux target system.

We have a problem where the memory used by our java process grows indefinitely, until Linux finally kills the java process.

Our heap memory is healthy and stable. (we have profiled our heap extensively) We also used MemoryMXBean to monitor the application's non heap memory usage, since we believed the problem might lie there. However, what we see is that reported heap size + reported non heap size stays stable.

Here is an example of how the numbers might look when running the application on our target system with 1GB RAM (heap and non heap reported by MemoryMXBean, total memory used by Java process monitored using Linux's top command (resident memory)):

At startup:

  • 200 MB heap committed
  • 40 MB non heap committed
  • 320 MB used by java process

After 1 day:

  • 200 MB heap committed
  • 40 MB non heap committed
  • 360 MB used by java process

After 2 days:

  • 200 MB heap committed
  • 40 MB non heap committed
  • 400 MB used by java process

The numbers above are just a "cleaner" representation of how our system performs, but they are fairly accurate and close to reality. As you can see, the trend is clear. After a couple of weeks running the application, the Linux system starts having problems due to running out of system memory. Things start slowing down. After a few more hours the Java process is killed.

After months of profiling and trying to make sense of this, we are still at a loss. I feel it is hard to find information about this problem as most discussions end up explaining the heap or other non heap memory pools. (like Metaspace etc.)

My questions are as follows:

  1. If you break it down, what does the memory used by a java process include? (in addition to the heap and non heap memory pools)

  2. Which other potential sources are there for memory leaks? (native code? JVM overhead?) Which ones are, in general, the most likely culprits?

  3. How can one monitor / profile this memory? Everything outside the heap + non heap is currently somewhat of a black box for us.

Any help would be greatly appreciated.

Serenic
  • 243
  • 3
  • 10
  • One thing I noticed about GUI: an least some implementations allocate memory for the graphics directly. That is, if you tell it to draw a large area, even if only a tiny bit of it is visible, it will directly allocate area for the entire drawing, and that may OOM you. – RealSkeptic Aug 24 '16 at 08:13
  • Thats an interesting observation, RealSkeptic. I doubt that's what happens in our case, though, since this is something which builds up slowly over days/weeks. – Serenic Aug 24 '16 at 08:14
  • can't say for sure, but it's worth researching, because I'm not sure under which circumstances the graphics memory is released. Even if you're always allocating small areas, if they stay in use somehow, it will cause a memory leak. – RealSkeptic Aug 24 '16 at 08:16
  • Sounds like a memory-leak in native code - but then I don't know if you're using any . – piet.t Aug 24 '16 at 08:18
  • Is there a way to run the activities spread over multiple days in small timeframe? A easily reproducible problem will be of great use in debugging. – Ashwinee K Jha Aug 24 '16 at 08:31
  • Piet: A memory leak in native code is probably very possible. We are currently theorizing that the problem most likely is caused by one of our dependencies. How do you profile memory used by native code though? – Serenic Aug 24 '16 at 08:41
  • Ashwinee: There seems to be no differences whether we've been actively using the system a lot or just leaving it idle for a couple of weeks. It seems like the memory grows faster sometimes than others, but we haven't been able to detect a pattern. The only thing we see when stressing the system is more activity on the heap, which is natural. – Serenic Aug 24 '16 at 08:46
  • "How do you profile memory used by native code though?" Maybe try making a small native binary which uses every external library and puts them under stress (i.e. making extensive use of their functions). Then use, e.g., Valgrind to find memory leaks. (Don't know if running JVM under Valgrind is a good idea or even works as desired.) – Martin Nyolt Sep 07 '16 at 08:04
  • See http://stackoverflow.com/questions/26041117/growing-resident-memory-usage-rss-of-java-process/35610063 – Lari Hotari Sep 07 '16 at 09:33

2 Answers2

4

I'll try partially answer your question.

The basic strategy I'm trying to stick to in such situations is to make a monitoring of max/used/peak values for each memory pool available, opened files, sockets, buffer pools, number of threads, etc. There might be a leakage of socket connections/opened files/threads which you can miss.

In your case it looks like you are really have a problem with native memory leakage which is quite nasty and hard to find.

You may try to profile memory. Take a look at GC roots and find out which ones is JNI global references. It may help you to find out which classes may be not collected. For example this is a common problem in awt which may require explicit component disposal.

To inspect JVM internal memory usage (which is not belongs to heap/off-heap memory) -XX:NativeMemoryTracking is very handy. It allows you to track thread stack sizes, gc/compiler overheads and much more. The greatest thing about it is that you can create a baseline in any point of time and then track memory diffs since baseline was made

# jcmd <pid> VM.native_memory baseline
# jcmd <pid> VM.native_memory summary.diff scale=MB

Total:  reserved=664624KB  -20610KB, committed=254344KB -20610KB
...

You can also use JMX com.sun.management:type=DiagnosticCommand/vmNativeMemory command to generate this reports.

And... You can go deeper and inspect pmap -x <pid> and/or procfs content.

Community
  • 1
  • 1
vsminkov
  • 10,912
  • 2
  • 38
  • 50
3

We finally seem to have identified the root cause of the problem we had. This is a answer to what specifically caused that problem, as it's not unlikely this may be of use for others.

TLDR:

The problem was caused by a bug in the JDK which is now fixed and will realease with JDK 8u152. Link to the bug report

The whole story:

We continued to monitor our application's memory performance after I first posted this question, and the XX:NativeMemoryTracking suggested by vsminkov helped greatly with narrowing down and pinpointing the area in memory which was leaking.

What we found was that the "Tread - Arenas" area was growing indefinitely. As this kind of leak was something we were pretty sure we hadn't experienced earlier, we started testing with earlier versions of java to see if this was introduced at some specific point.

After going back down to java 8u73 the leak wasn't there, and although being forced to use an older JDK version wasn't optimal, at least we had a way to get around the problem for now.

Some weeks later, while running on update 73, we noticed that the application was still leaking, and once again we started searching for a culprit. We found the problem was now located in the Class - malloc area.

At this point we were almost certain the leak was not our application's fault, and we were considering contacting Oracle to report the issue as a potential bug, but then a colleague of mine stumbled across this bug report on the JDK hotspot compiler: Link to the bug report

The bug description is very similar to what we saw. According to what is written in the report, the memory leak has been present since the java 8 release, and after testing with an early release of JDK 8u152 we are now fairly certain the leak is fixed. After 5 days of running, our application's memory footprint now seems close to 100% stable. The class malloc area still grows slightly, but it's now going up at a rate of about 100 KB a day (compared to several MBs earlier), and having tested for only 5 days I can't say for sure it won't stabilize eventually.

I can't say for certain, but it seems likely the issues we had with the Class malloc and Thread arenas growing were related. At any point, both problems are gone in update 152. Unfortunately, the update doesn't seem to be scheduled for official release until late 2017, but our testing with the early release seems promising so far.

Community
  • 1
  • 1
Serenic
  • 243
  • 3
  • 10