I have been given an interesting task of identifying a stop the world garbage collection / memory leak in a third party "black box" restful application, which is in production.
The application is load balanced, and recently, the application had a stop-the-world garbage collection on all server instances, which led to a production service outage.
I (we) don't have access to the third-party code.
This is what I have done so far:
- I have been ensuring the JVM command line parameters are correct. The container is Jetty, OpenJdk 8, with the CMS garbage collector.
- I have successfully been using VisualVM, with Memory Pools and Visual GC plugins to profile the app (-verbosegc is enabled).
- My intention is to look at the amount of traffic we get in production (for each API endpoint), and run a soak test. I will increase the test load, with the intention of causing the stop the world GC to happen.
- There is no specific out-of-memory exception, "just" a stop-the-world, with the application threads suspended. After 5-10 minutes, the application starts to accept requests again (502 on the load balancer go).
I have already looked at How to find a Java Memory Leak
I am at a disadvantage not being able to look at the source code.
Can someone please give me any further tips, or strategies on how to track down what is causing the stop-the-world GC, and memory leak.
Here are the JVM parameters which are being used:
java -Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=9010
-Dcom.sun.management.jmxremote.local.only=true
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Xms6g -Xmx6g -XX:MetaspaceSize=2g -XX:MaxMetaspaceSize=2g
-server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
-Dsun.net.client.defaultConnectTimeout=10000
-Dsun.net.client.defaultReadTimeout=30000
-XX:+DisableExplicitGC -d64 -verbose:gc -Xloggc:/var/log/gc.log
-XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/heapdump.hprof
-XX:+UseCMSCompactAtFullCollection -XX:+CMSClassUnloadingEnabled
-XX:+ParallelRefProcEnabled
-XX:+UseLargePagesInMetaspace -XX:MaxGCPauseMillis=100
Thanks
Miles.