6

Problem description

After a while of running my java server application I am experiencing strange behaviour of Oracle Java virtual machine on Solaris. Normally, when there is a crash of jvm hs_err_pid.log file gets created (location is determined by -XX:ErrorFile jvm paramter as explained here: How can I suppress the creation of the hs_err_pid file?

But in my case, the file was not created, the only thing left was the core core dump file.

Using pstack and pflags standard Solaris tools I was able to gather more information about the crash (which are included below) from the core file.

Tried solutions

  • Tried to find all hs_err_pid.log files across the file system, but nothing could be found (even outside the application working directory). i.e.:

    find / -name "hs_err_pid*"

  • I tried to find jvm bugs related to jvm, but I couldn't find nothing interesting similar to this case.

  • The problem looks somewhat similar to: Java VM: reproducable SIGSEGV on both 1.6.0_17 and 1.6.0_18, how to report? but still I cannot confirm this since the hs_err_pid.log file is missing and of course the OS platform is different.
  • (EDIT) As suggested in one of the answers to Tool for analyzing java core dump question, I have extracted heap dump from the core file using jmap and analysed it with with Eclipse MAT. I have found a leak (elements added to HashMap, never to be cleansed, at the time of core dump 1,4 M elements). This however does not explain why hs_err_pid.log file was not generated, nor jvm crashing.
  • (EDIT2) As suggested by Darryl Miles, -Xmx limitations has been checked (Test contained code that indefinitely added objects to a LinkedList):
    • java -Xmx1444m Test results with java.lang.OutOfMemoryError: Java heap space,
    • java -Xmx2048m Test results with java.lang.OutOfMemoryError: Java heap space,
    • java -Xmx3600m Test results with core dump.

The question

Has anyone experienced similar problem with jvm and how to proceed in such cases to find what actually happened (i.e. in what case the core gets dumped from the jvm and no hs_err_pid.log file is created)?

Any tip or pointer to resolving this would be very helpful.

Extracted flags

# pflags core
...
/2139095:      flags = DETACH
    sigmask = 0xfffffeff,0x0000ffff  cursig = SIGSEGV

Extracted stack

# pstack core
...
-----------------  lwp# 2139095 / thread# 2139095  --------------------
 fb208c3e ???????? (f25daee0, f25daec8, 74233960, 776e3caa, 74233998, 776e64f0)
 fb20308d ???????? (0, 1, f25db030, f25daee0, f25daec8, 7423399c)
 fb20308d ???????? (0, 0, 50, f25da798, f25daec8, f25daec8)
 fb20308d ???????? (0, 0, 50, f25da798, 8561cbb8, f25da988)
 fb203403 ???????? (f25da988, 74233a48, 787edef5, 74233a74, 787ee8a0, 0)
 fb20308d ???????? (0, f25da988, 74233a78, 76e2facf, 74233aa0, 76e78f70)
 fb203569 ???????? (f25da9b0, 8b5b400, 8975278, 1f80, fecd6000, 1)
 fb200347 ???????? (74233af0, 74233d48, a, 76e2fae0, fb208f60, 74233c58)
 fe6f4b0b __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ (74233d44, 74233bc8, 74233c54, 8b5b400) + 1a3
 fe6f4db3 __1cCosUos_exception_wrapper6FpFpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v2468_v_ (fe6f4968, 74233d44, 74233bc8, 74233c54, 8b5b4
00) + 27
 fe6f4deb __1cJJavaCallsEcall6FpnJJavaValue_nMmethodHandle_pnRJavaCallArguments_pnGThread__v_ (74233d44, 8975278, 74233c54, 8b5b400) + 2f
 fe76826d __1cJJavaCallsMcall_virtual6FpnJJavaValue_nLKlassHandle_nMsymbolHandle_4pnRJavaCallArguments_pnGThread__v_ (74233d44, 897526c, fed2d464, fed2d6d0, 7
4233c54, 8b5b400) + c1
 fe76f4fa __1cJJavaCallsMcall_virtual6FpnJJavaValue_nGHandle_nLKlassHandle_nMsymbolHandle_5pnGThread__v_ (74233d44, 8975268, 897526c, fed2d464, fed2d6d0, 8b5b
400) + 7e
 fe7805f6 __1cMthread_entry6FpnKJavaThread_pnGThread__v_ (8b5b400, 8b5b400) + d2
 fe77cbe4 __1cKJavaThreadRthread_main_inner6M_v_ (8b5b400) + 4c
 fe77cb8e __1cKJavaThreadDrun6M_v_ (8b5b400) + 182
 feadbd59 java_start (8b5b400) + f9
 feed59a9 _thr_setup (745c5200) + 4e
 feed5c90 _lwp_start (745c5200, 0, 0, 74233ff8, feed5c90, 745c5200)

System information:

# uname -a
SunOS xxxx 5.10 Generic_137138-09 i86pc i386 i86pc
# java -version
java version "1.6.0_11"
Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
Java HotSpot(TM) Server VM (build 11.0-b16, mixed mode)
# ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 10240
coredump(blocks) unlimited
nofiles(descriptors) 256
memory(kbytes) unlimited

Used jvm args:

java -Xms1024M -Xmx2048M -verbose:gc -Xloggc:logs/gc.log -server com.example.MyApplication

Please comment if you find some information missing, I'll try to add them.

Community
  • 1
  • 1
pkk
  • 3,691
  • 2
  • 21
  • 29
  • Is the startup-directory and/or current-working-directory writable by the JVM ?You think the cause of the crash is a memory leak with too many objects, it is unusual to see any corefile for this a graceful OutOfMemoryError should occur. Unless there is some JNI error.It is possible the crash handler (the code that writes out hs_err_pid*.log) crashed. Maybe run a system call trace on the running process to observe what it is doing at the end its life (i.e. you should be able to see if crash handler crashed, and if it tried to create any file hs_err_pid*.log).On Linux "strace" on Solaris "truss" – Darryl Miles Sep 29 '11 at 03:27
  • Thank you for your answer. I am sure that the working directory is writeable (two reasons for that: jvm is run with root privileges, and it has been written earlier on with success). In general the application uses no jni code itself, nor it's jar dependencies. But the question remains the same, why core dump and no hs_rr_pid*.log file? Is it possible that Solaris kernel send one of "core-dumping" signals, i.e.: http://developers.sun.com/solaris/articles/signalprimer.html ? – pkk Sep 29 '11 at 06:40
  • If you gradually reduce -Xmx1700M can you get the process to throw OutOfMemoryError ? what is the maximum usable -Xmx value on Solaris, for example it is around 1800M on Windows 32bit (since by default kernels has upper 2Gb and the DLL/shared data/stack eat around 200M. Did you manage to run "truss" on the process to obtain a useful output ? What are your "ulimit -a" settings (say from a bash shell). – Darryl Miles Oct 04 '11 at 22:22
  • So what you are implying is, that low stack limit, with excessive vm usage may cause this? – pkk Oct 05 '11 at 08:11
  • That isn't such a low stack limit, unless you are doing massive recusion and needing a lot of stack space. "ulimit -Ha" might indicate the hard limit is unlimited and I'd expect a JVM to manage that accordingly as required. JVM stacks and C language stacks are different. How many application threads is the program ? I would hope 'truss' can indicate what signals occur and also if/when it runs out of heap (kernel refuses to provide more heap via brk/sbrk syscalls. Under Linux all this is auditable with 'strace'). I am suggesting that by reducing -Xmx you might be able to get a controlled OOM. – Darryl Miles Oct 05 '11 at 08:28
  • I hope this is the right answer to your question, but application uses 114 lwps, as shown by `prstat -m -L -p APPLICATION_PID`. I wouldn't want to truss a running production server, so this information is not yet available for me. – pkk Oct 05 '11 at 10:09
  • @Darryl Miles, Please move your -Xmx limitations comments in form of an answer, and you will receive the bounty as it helped find the maximum usable memory on the production machine. Details follow in the question edit. – pkk Oct 07 '11 at 07:16

2 Answers2

3

6.0_11 is quite old and I have no recent experiences with, really recommend upgrade there...

However, no crash dump may occur with stackoverflow in the native code, i.e. calling some native function (like write of FileOutputStream, sockets use the same impl) with very low stack. So, even though the JVM attempts to write the file, there is not enough stack and the writing code also crashes. The second stackoverflow just bails out the process.

I did have similar case (no file created) on a production system and it was not pretty to trace it, yet the above explains the reason.

bestsss
  • 11,796
  • 3
  • 53
  • 63
  • It is my understanding there is no native code other than that provided by the JRE/JDK (as per pkk comment#2). However the heap limits look to exhausted hence reducing -Xmx to make the JVM fail via OOM before the kernel starts to deny new memory allocation requests due to no more heap. This give the JVM breathing space to recover gracefully by still memory head room to work in. – Darryl Miles Oct 05 '11 at 08:35
  • You want the JVM to police the memory limit while the JVM itself can still ask and get more memory from the kernel to do its own internal things.Setting -Xmx too high causes the kernel to enfore limits which means the JVMs internal operations lose out as it too has run out of "railroad".So the question is what is a reasonable value for -Xmx for Solaris 32bit system ?I know for Windows 32bit its around 1800M so you want to be lower than that, but it can depends as DLL/shared data/thread stacks will also compete for the same process address space. Maybe pkk can use a 64bit JVM & problem go away? – Darryl Miles Oct 05 '11 at 08:40
  • @Darryl, I am talking about the stack not heap (which you can set via -Xmx), and JVM has native code, quite a bit of it. Like I told FileOutputStream, FileInputStream execution the real write/read in native code. 32bit systems are limited by the larges continuous memory block they can allocate of virtual memory. – bestsss Oct 05 '11 at 08:48
  • In my view, if you are using any application that needs to use more than 1.5Gb of memory as a working set in a single process (on any OS) then you seriously should consider 64bit. It is hard to justify reasons to tie yourself up in technical workarounds to shoe-horn the program into a 32bit paradigm. Last time I looked at Solaris running a 64bit JVM just required using "-d64" on the java command line. "java -d64 -server -version" ? – Darryl Miles Oct 05 '11 at 08:49
  • @Darryl, I am usually not using 32bit java, definitely not for production. – bestsss Oct 05 '11 at 08:54
  • I don't think really, that in this case native stack overflow occurred, but still switching -d64 made the core dump go away. It looks like a bug in Solaris 32bit jvm implementation. I shall accept the answer. – pkk Oct 07 '11 at 07:13
  • @pkk, nice, very nice for fixing-up the problem, btw did you see any messages in the kernel like, like being liked by the oom_killer or anything like that? – bestsss Oct 07 '11 at 07:57
  • oom_killer is usualy used when the entire VM system of the machine is exhausted. It is possible to have 1Gb RAM and 5Gb swap and the system runs this one 32bit application and never activates the oom_killer. But if you ran 3 or 4 of these same apps all using at least 2Gb then oom_killer does its thing to rescue the OS/host. – Darryl Miles Oct 07 '11 at 20:29
  • @Darryl, I run production systems w/ swapoff -a, always, if java hits the swap, it's doomed. – bestsss Oct 07 '11 at 21:01
3

As per my comments above. I beleive this issue to be running out of usable heap in 32bit address space by having set too high a -Xmx value. This forced the Kernel to police the limit (by denying requests for new memory) before the JVM could police it (by using controlled OutOfMemoryException mechanism). Unfortunately I do not know the specifics of Intel Solaris to know what is to be expected from that platform.

But as a general rule for Windows a maximum -Xmx might be 1800M and then reduce it by 16M per additional application thread you create. Since each thread needs stack space (both native and Java stack) as well as other per-thread accounting matters like Thread Local Storage etc... The result of this calculation should give you an approximation of the realistic usable heap space of a Java VM on any 32bit bit process whose operating system uses a 2G/2G split (User/Kernel).

It is possible with WinXP and above to use /3G switch on the kernel to get higher split (3G/1G user/kernel) and Linux has a /proc/<pid>/map file to allow you to see exactly how the process address space is laid out of a given process (if you were running this application you could watch over time as the [heap] grows to meet the shared file mappings used for .text/.rodata/.data/etc... from DSOs this results in the kernel denying requests to grow the heap.

This problem goes away for 64bit because there is so much more address space to use and you will run out of physical and virtual (swap) memory before the heap meets the other mappings.

I believe 'truss' on Solaris would have show up a brk/sbrk system-call that returned an error code, shortly before the core dump. Parts of standard native libraries are coded to never check the return code from requests for new memory and as a result crashes can be expected.

Darryl Miles
  • 4,576
  • 1
  • 21
  • 20