Possible causes
The following problems will usually show very similar symptoms (your program :
- Deadlocks: Circular dependencies on a shared sressource
- Live-Locks: Resource lock is handled around, but no progress is made (one step forward, one step back)
- Resource starvation: The one doing the work does not get what he needs. All others seem busy but make no progress.
- Heavy swapping: Progress is so slow, that the systems comes so a halt
- Too many threads (OS overload): System is completely busy with managing resources, so there is no time to do real work
Diagnostic Tools
Debugger on developer machine
Deadlocks and similar problems are usually very hard to reproduce and hence cannot be easily analyzed with a debugger. The problem often occurs on the live system but not on a developer machine. The cause might be a different work load, a different data constellation or even different OS or different hardware (more CPU cores, NUMA architecture, etc.)
Remote debugger
You can try to attach to the production system using a remote debugger. You should only do this, when you can risk a complete halt of the production machine (e.g. because of a HVM crash!). You should only do this in a pair debugging session and discuss each step with a peer.
Logging and visualization
Use excessive logging and visualize the log data (R Studio, Mathematica, etc.). Be aware that the logging might change your system. Naive logging will change your live system according to performance and additional logging. Try asynchronous logging and test the performance impact of your logging before deploying it to the live system. Plan how you want to visualize your log data and what you would expect to see for the different possible causes described above. Otherwise you might miss that one log statement that will help you to show the root cause.
REPL
Query your live system by introducing a "command line" (REPL). By adding a command line to your live system, you can query it and change parts of it to diagnose the root cause. You can use the Clojure REPL, the Scala SBT REPL, Bean shell or add live changes using JRebel together with an external trigger to run the swapped code (WebService, scheduler, message queue, etc). Work in pairs (discuss each command before running it) and remember to protect the REPL against outside access (bind on localhost or on a Unix socket, use a named pipe, double check your firewall, authorize with a public key, log each access on a special log, etc.)!
JMX
Usually you can connect to a running Java VM using Java Management Extensions (JMX). Using JConsole or Java Visual VM, you can inspect the current stack trace for each thread and you can search for deadlocks. Additionally you can deploy own sensors in your application. Using DTrace (when your system supports this, e.g. Solaris, FreeBSD, NetBSD, Mac OS X), you can even monitor parts of the operating system.
You can add your own sensors, by providing MBeans or MXBeans (stricter typing, better compatibility).
Diagnostics
Deadlock
JConsole and VisualVM both have a function to find deadlocks and show the threads involved in the deadlock. Together with the function to show the current stack trace of each thread, diagnosing deadlocks becomes a breeze.
Live locks and resource starvation
When you add counters in your workers which get incremented when the worker gets the lock and when the worker has successfully made real progress, it becomes easy to find out, if your application makes progress or which workers are just juggling ressources arround without achieving anything.
You can query the counters using a remote debugger, JMX (if you add a sensor), the REPL or add according log messages. Using a REPL or by replacing code in the live system, you can introduce counters, log messages or JMX sensors when needed.
Swapping or OS overload
With JMX and DTrace you can analyze parts of your operating system. With a REPL you might be able to get OS and JVM statistics from the running process. With log statements or custom JMX sensors, you can monitor the performance metrics of your application.
It is crucial to measure the performance of your application when it runs fine, so you have some baseline values. Otherwise you won't be able to judge, if a measured value is fine or if it indicates a problem.