2

We have a JAVA application that is crashing our redhat server (30 cores / 512Go ram) by consuming some (unknown?) ressource preventing other components from creating new threads, we're currently working around this by killing the process that is spamming the threads each time the problem apears and that's about every 15 days, we attempted to set huge values on /etc/security/limits.conf but we get the problem way before reaching that limit.

I counted the threads last time it happend using ps -efL | wc -l , is 10000 thread a lot for our beast knowing that the CPU/RAM consumption was low at that moment? I used gstack to try to figure out where it is stuck but since it is a JAVA program idk if the output is meaningful? but i could identify a pattern there: most of the 9000 threads look like this:

Thread 9049 (Thread 0x7f43d5087700 (LWP 123925)):
#0  0x00007f43d791e705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f43d6a94f33 in os::PlatformEvent::park() () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#2  0x00007f43d6a58e67 in Monitor::IWait(Thread*, long) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#3  0x00007f43d6a59786 in Monitor::wait(bool, long, bool) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#4  0x00007f43d6c48e1b in GangWorker::loop() () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#5  0x00007f43d6a9bd48 in java_start(Thread*) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#6  0x00007f43d791adf5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f43d722f1ad in clone () from /lib64/libc.so.6
Thread 9048 (Thread 0x7f43d4f86700 (LWP 123926)):
#0  0x00007f43d791e705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f43d6a94f33 in os::PlatformEvent::park() () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#2  0x00007f43d6a58e67 in Monitor::IWait(Thread*, long) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#3  0x00007f43d6a59786 in Monitor::wait(bool, long, bool) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#4  0x00007f43d6c48e1b in GangWorker::loop() () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#5  0x00007f43d6a9bd48 in java_start(Thread*) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#6  0x00007f43d791adf5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f43d722f1ad in clone () from /lib64/libc.so.6

Also before killing the process I used gcore -o /tmp/dump.txt , is it a correct way to get a corefile of a java process?

When i attempt to take a look using gdb I get no debugging symbols and not a core dump, is this the right way to check this kind of files?

M1:~# gdb /opt/3pp/jre/bin/java /tmp/dump.txt.123913 
GNU gdb (GDB) 
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/3pp/jre/bin/java...(no debugging symbols 
"/tmp/dump.txt.123913" is not a core dump: File format not recognized
Missing separate debuginfos, use: debuginfo-install jre1.8.0_25-1.8.0_25-fcs.x86_64

Thanks in advance for your time.

St.Antario
  • 26,175
  • 41
  • 130
  • 318

2 Answers2

2

I counted the threads last time it happend using ps -efL | wc -l , is 10000 thread a lot for our beast knowing that the CPU/RAM consumption was low at that moment?

It's a not an insignificant number of threads, but no, 10K threads is not that much, especially for a 30 core machine. The 4 core Windows desktop I'm currently on has ~3K.

I used gstack to try to figure out where it is stuck but since it is a JAVA program idk if the output is meaningful?

I never tried debugging Java using native thread stacks, but that stack trace, to me, looks like a "parked" thread. In other words, a thread in some thread pool that has nothing to do, so it's waiting for work. See this answer for more details.

Also before killing the process I used gcore -o /tmp/dump.txt , is it a correct way to get a corefile of a java process?

It probably has some value, but I would suggest using java-specific tools for the job. The first thing that comes to mind is jcmd which comes with the JDK. Here's a link to get you started. Java 9's version has some nicer documentation, and is very similar.

What I'd specifically do is use the Thread.print command of jcmd to print java-level stack traces and GC.heap_dump to dump of the entire java heap into an .hprof file which can later be analyzed by tools such as MAT.

If you're using a JDK 8 with "Commercial Features", you could also enable the JFR (Java Flight Recorder which tracks the execution of the process. The files created by JFR can be opened either using Oracle's "Mission Control", or an alternative "Mission Control", such as the one from Azul, called Zulu.

Finally, you could also try to connect to the process using using jconsole, which is another tool that comes with the JDK.

Good luck.

Malt
  • 28,965
  • 9
  • 65
  • 105
1

I'll give you a general advice regarding JVM core files so you can choose whether to dig into it or no.

Is a java corefile generated using gcore useful?

It does useful, but if you are not aware of a specific JVM implementation it would look like a mess. The stacktraces are completely fine and the crash definitely did not happen because of calling pthread_cond_wait (unless pthread itself is buggy which is extremely unlikely).

We have a JAVA application that is crashing

Did you run memory test? JVM HotSpot implementation is highly reliable in vast majority of cases.

is it a correct way to get a corefile of a java process?

You can also use generate-core-file in gdb

no debugging symbols

You already showed the stacktraces with debug symbols. Probably something wrong with the core file. Try gdb generate-core-file

In case you want to dig into HotSpot core dump then I could advice the following sequence of actions to do:

  1. info threads to find the "crash thread"
  2. Go to this thread with thread N where N is the "crash thread" number
  3. disas to disasemble the function and find out the instruction caused the crash
  4. In case it is just happened because of dereferencing garbage pointer reverse it down to figure out where the value come from

Things get complicated if you crashed with something like Bus Error because of incorrect mmap usage or so. Things get even more complicated in case you crashed in JIT compiled method so bt, disas and friends would not be useful. Possible way to go is to dump compiled code from nmethod by the code_offset and try to figure out what went wrong.

St.Antario
  • 26,175
  • 41
  • 130
  • 318