3

I have an weird issue with a Java process which is stuck (once/twice per day) and it only recovers after I execute:

jstack -F ${PID}

While the Java process is stuck, If I try to do a thread dump with jcmd, I receive AttachNotSupportedException.

I only managed to do a thread dump using jstack -F, and using JDK version in sync with JRE version, JRE version that was used to start the java process.

The only thing I can think of, is that maybe the OS Scheduler does not allow CPU time for the Java process, and if I do a jstack -F I force it to allow it to run?

Any feedback will be appreciated.

Regards,

Cristi

UPDATE-1

Today it happen again. First thing I have checked is the memory used (99.1%) on that box. Afterwards I have performed a jmap -heap, and process resumed without any issues after the heap dump. Heap dump attached.

jmap -heap 7703
Attaching to process ID 7703, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.162-b12

using thread-local object allocation.
Parallel GC with 2 thread(s)

Heap Configuration:
   MinHeapFreeRatio         = 0
   MaxHeapFreeRatio         = 100
   MaxHeapSize              = 536870912 (512.0MB)
   NewSize                  = 89128960 (85.0MB)
   MaxNewSize               = 178782208 (170.5MB)
   OldSize                  = 179306496 (171.0MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 143130624 (136.5MB)
   used     = 73244792 (69.85167694091797MB)
   free     = 69885832 (66.64832305908203MB)
   51.1733897003062% used
From Space:
   capacity = 17825792 (17.0MB)
   used     = 8176960 (7.79815673828125MB)
   free     = 9648832 (9.20184326171875MB)
   45.871510225183826% used
To Space:
   capacity = 17825792 (17.0MB)
   used     = 0 (0.0MB)
   free     = 17825792 (17.0MB)
   0.0% used
PS Old Generation
   capacity = 243269632 (232.0MB)
   used     = 23534032 (22.443801879882812MB)
   free     = 219735600 (209.5561981201172MB)
   9.674052534432247% used

25964 interned Strings occupying 2759784 bytes.

UPDATE-2

After having GC logs enabled, when the process freezes, this is the tail of the GC log.

2020-09-02T06:51:11.286+0000: 86020.549: Total time for which application 

threads were stopped: 0.0001978 seconds, Stopping threads took: 0.0000666 seconds
2020-09-02T06:51:11.286+0000: 86020.550: Application time: 0.0000610 seconds
2020-09-02T06:51:11.286+0000: 86020.550: Total time for which application threads were stopped: 0.0001793 seconds, Stopping threads took: 0.0000589 seconds
2020-09-02T06:51:11.287+0000: 86020.550: Application time: 0.0003371 seconds
2020-09-02T06:51:11.287+0000: 86020.550: Total time for which application threads were stopped: 0.0001749 seconds, Stopping threads took: 0.0000283 seconds
2020-09-02T06:51:11.287+0000: 86020.550: Application time: 0.0001277 seconds
2020-09-02T06:51:11.287+0000: 86020.550: Total time for which application threads were stopped: 0.0001554 seconds, Stopping threads took: 0.0000364 seconds
2020-09-02T06:51:11.287+0000: 86020.551: Application time: 0.0000400 seconds
2020-09-02T06:51:11.287+0000: 86020.551: Total time for which application threads were stopped: 0.0001082 seconds, Stopping threads took: 0.0000158 seconds
2020-09-02T06:51:11.288+0000: 86020.552: Application time: 0.0010649 seconds
2020-09-02T06:51:11.288+0000: 86020.552: Total time for which application threads were stopped: 0.0001945 seconds, Stopping threads took: 0.0000571 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Application time: 0.0001078 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Total time for which application threads were stopped: 0.0001852 seconds, Stopping threads took: 0.0000336 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Application time: 0.0000366 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Total time for which application threads were stopped: 0.0000910 seconds, Stopping threads took: 0.0000180 seconds
2020-09-02T06:51:11.289+0000: 86020.552: Application time: 0.0000412 seconds
2020-09-02T06:51:11.289+0000: 86020.553: Total time for which application threads were
Cristi
  • 180
  • 15
  • It looks like I am able to resume the process if I send a kill -SIGCONT $(PID) as well, sign that maybe kernel have sent a kill -SIGSTOP to the process, maybe due lack of resources. The machines where I see this issues, are where there is a lot of load. – Cristi Aug 11 '20 at 16:20
  • 1
    At first it looked like it is the issue from this question https://stackoverflow.com/questions/34251580/jvm-freeze-under-high-load-in-longevity-tests But it looks like the bug was fixed in the current kernel version I am using root@hostname /]# rpm -q --changelog kernel-2.6.32-754.29.1.el6.x86_64 | grep 'get_futex_key_refs' - [kernel] futex: Ensure get_futex_key_refs() always implies a barrier (Larry Woodman) [1167405] – Cristi Aug 11 '20 at 17:23
  • 2
    We've been struggling against a similar issue, and this was indeed a kernel bug. Upgrading to Linux 4.x helped. – apangin Aug 14 '20 at 22:22
  • 1
    One possibility could be that at that time full GC is happening, you may like to change the GC to CMS, also take thread dump at that time and see if any code creating to many objects or if there is eny memory leak and some objects are not being GCed? – Ravi Yadav Aug 21 '20 at 08:35

1 Answers1

1

If you receive AttachNotSupportedException while getting thread dump with jcmd try to run jcmd under the same user as java process runs. See com.sun.tools.attach.AttachNotSupportedException: Unable to open socket file: target process not responding or HotSpot VM not loaded

Rustem
  • 11
  • 1