We have a JAVA application that is crashing our redhat server (30 cores / 512Go ram) by consuming some (unknown?) ressource preventing other components from creating new threads, we're currently working around this by killing the process that is spamming the threads each time the problem apears and that's about every 15 days, we attempted to set huge values on /etc/security/limits.conf but we get the problem way before reaching that limit.
I counted the threads last time it happend using ps -efL | wc -l , is 10000 thread a lot for our beast knowing that the CPU/RAM consumption was low at that moment? I used gstack to try to figure out where it is stuck but since it is a JAVA program idk if the output is meaningful? but i could identify a pattern there: most of the 9000 threads look like this:
Thread 9049 (Thread 0x7f43d5087700 (LWP 123925)):
#0 0x00007f43d791e705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f43d6a94f33 in os::PlatformEvent::park() () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#2 0x00007f43d6a58e67 in Monitor::IWait(Thread*, long) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#3 0x00007f43d6a59786 in Monitor::wait(bool, long, bool) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#4 0x00007f43d6c48e1b in GangWorker::loop() () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#5 0x00007f43d6a9bd48 in java_start(Thread*) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#6 0x00007f43d791adf5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007f43d722f1ad in clone () from /lib64/libc.so.6
Thread 9048 (Thread 0x7f43d4f86700 (LWP 123926)):
#0 0x00007f43d791e705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f43d6a94f33 in os::PlatformEvent::park() () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#2 0x00007f43d6a58e67 in Monitor::IWait(Thread*, long) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#3 0x00007f43d6a59786 in Monitor::wait(bool, long, bool) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#4 0x00007f43d6c48e1b in GangWorker::loop() () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#5 0x00007f43d6a9bd48 in java_start(Thread*) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#6 0x00007f43d791adf5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007f43d722f1ad in clone () from /lib64/libc.so.6
Also before killing the process I used gcore -o /tmp/dump.txt , is it a correct way to get a corefile of a java process?
When i attempt to take a look using gdb I get no debugging symbols and not a core dump, is this the right way to check this kind of files?
M1:~# gdb /opt/3pp/jre/bin/java /tmp/dump.txt.123913
GNU gdb (GDB)
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/3pp/jre/bin/java...(no debugging symbols
"/tmp/dump.txt.123913" is not a core dump: File format not recognized
Missing separate debuginfos, use: debuginfo-install jre1.8.0_25-1.8.0_25-fcs.x86_64
Thanks in advance for your time.