I have a multi-threaded application which scales well to begin with, but running on a 16-cpu server, once I exceed 5 or 6 hardware threads the performance levels off. I suspect that the bottleneck surrounds one of the synchronized methods. However, I need to be sure it's the guilty method before I start diving into the code and trying to replace the algorithm with a non-blocking one.
Running Java with the -Xprof argument tells me that, as I expected, the threads are spending most of their time blocked. Is there a way that I can break that down into how much time they spend blocked at a particular method?