1

I have been given an interesting task of identifying a stop the world garbage collection / memory leak in a third party "black box" restful application, which is in production.

The application is load balanced, and recently, the application had a stop-the-world garbage collection on all server instances, which led to a production service outage.

I (we) don't have access to the third-party code.

This is what I have done so far:

  1. I have been ensuring the JVM command line parameters are correct. The container is Jetty, OpenJdk 8, with the CMS garbage collector.
  2. I have successfully been using VisualVM, with Memory Pools and Visual GC plugins to profile the app (-verbosegc is enabled).
  3. My intention is to look at the amount of traffic we get in production (for each API endpoint), and run a soak test. I will increase the test load, with the intention of causing the stop the world GC to happen.
  4. There is no specific out-of-memory exception, "just" a stop-the-world, with the application threads suspended. After 5-10 minutes, the application starts to accept requests again (502 on the load balancer go).

I have already looked at How to find a Java Memory Leak

I am at a disadvantage not being able to look at the source code.

Can someone please give me any further tips, or strategies on how to track down what is causing the stop-the-world GC, and memory leak.


Here are the JVM parameters which are being used:

java -Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.port=9010
-Dcom.sun.management.jmxremote.local.only=true               
-Dcom.sun.management.jmxremote.authenticate=false  
-Dcom.sun.management.jmxremote.ssl=false
-Xms6g -Xmx6g -XX:MetaspaceSize=2g -XX:MaxMetaspaceSize=2g
-server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
-Dsun.net.client.defaultConnectTimeout=10000 
-Dsun.net.client.defaultReadTimeout=30000
-XX:+DisableExplicitGC -d64 -verbose:gc -Xloggc:/var/log/gc.log 
-XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/heapdump.hprof 
-XX:+UseCMSCompactAtFullCollection -XX:+CMSClassUnloadingEnabled 
-XX:+ParallelRefProcEnabled 
-XX:+UseLargePagesInMetaspace -XX:MaxGCPauseMillis=100

Thanks

Miles.

chocksaway
  • 870
  • 1
  • 10
  • 21
  • 2
    "stop-the-world" may not be caused by memory leak! Looking into the source code would not help you identify the memory leak problem. What you need is a Profiler tool. I have been using JProfile or YourKit but if you don't have these tools VisualVM or JConsole also help you view the "Full-GC" when it happened. Have you exploited other GC management? e.g. ParalledOldGC... – Minh Kieu Jun 04 '17 at 21:26
  • *"-verbosegc is enabled"* - if you have GC logs you should post the events in question + some context. – the8472 Jun 05 '17 at 08:37
  • I have added the JVM parameters to the original question. – chocksaway Jun 05 '17 at 10:14

0 Answers0