How do I compensate for not having a "quiet" machine when benchmarking my Java application?

Question

I run numerical simulations all the time. I can tell if my simulations don't work (i.e., they fail to give acceptable answers), but because I typically run a variable number of these on designated cores running in the background (as I work), looking at clock time tells me less than nothing about how quickly they ran.

I don't want clock time; I want CPU time. None of the articles seems to mention this little aspect. In particular, the recommendation to use a "quiet" machine seems to blur what's being measured.

I don't need a great deal of detail, I just want to know that simulation A runs about 15% faster or slower than simulation B or C, despite the fact that A ran by itself for a while, and then I started B, followed by C. And maybe I played for a little while before retiring, which would run a higher-priority application for part of that time. Don't tell me that ideally I should use a "quiet" machine; my question specifically asks how to do benchmarking without a dedicated machine for this. I also do not wish to kill the efficiency of my applications while measuring how long they take to run; it seems that significant overhead would only be required when a great deal of detail is needed. Am I right?

I want to modify my applications so that when I check whether a batch job succeeds, I can also see how long it took to reach these results in CPU time. Can benchmarking give me the answers I'm looking for? Can I simply use Java 9's benchmarking harness, or do I need something else?

You can use the time command: https://stackoverflow.com/questions/556405/what-do-real-user-and-sys-mean-in-the-output-of-time1 — Svetlin Zarev, Apr 10 '19 at 05:01

score 4 · Answer 1 · answered Apr 10 '19 at 08:31

You can measure CPU time instead of wall-clock time from outside the JVM easily enough on most OSes. e.g. time java foo.jar on Unix/Linux, or even perf stat java foo.jar on Linux.

The biggest problem with this is that some workloads have more parallelism than others. Consider this simple example. It's unrealistic, but the math works the same for real programs that alternate between more-parallel and less-parallel phases.

version A is purely serial for 9 minutes, and keeps 8 cores saturated for 1 minute. Wall-clock time = 10 minutes, CPU time = 17 minutes
version B is serial for 1 minute, and keeps all 8 cores busy for 5 minutes. Wall time = 6 minutes, CPU time = 5*8 + 1 = 41 minutes

If you were just looking at CPU time, you wouldn't know which version was stuck on an inherently serial portion of its work. (And this is assuming purely CPU-bound, no I/O waiting.)

For two similar implementations that are both mostly serial, though, CPU time and wall time could give you a reasonable guess.

But modern JVMs like HotSpot use multi-threaded garbage-collection, so even if your own code never starts multiple threads, one version that makes the GC do more work can use more CPU time but still be faster. That might be rare, though.

Another confounding factor: contention for memory bandwidth and cache footprint will mean that it takes more CPU time to do the same work, because your code will spend more time waiting for memory.

And with HyperThreading or other SMT cpu architectures (like Ryzen) where one physical core can act as multiple logical cores, having both logical cores active increases total throughput at the cost of lower per-thread performance.

So 1 minute of CPU time on a core where the HT sibling is idle can get more work done that when the other logical core was also active.

With both logical cores active, a modern Skylake or Ryzen might give you somewhere from 50 to 99% of the single-thread performance of having all the execution resources available for a single core, completely dependent on what the code is running on each thread. (If both bottleneck on latency of FP add and multiply with very long loop-carried dependency chains that out-of-order execution can't see past, e.g. both summing very large arrays in order with strict FP, that's the best case for HT. Neither thread will slow the other down, because FP add throughput is 3 to 8x FP add latency.)

But in the worst case, if both tasks slow slow down a lot from L1d cache misses, HT can even lose throughput from running both at once on the same core, vs. running one then the other.

Okay, here are some simplifying assumptions: (1) my code is single-threaded by design, (2) I'm not worried about whether the order of operations is changed or executed in parallel by the core itself, or cache spoilage. What does matter is that I’ve allocated 2 cores to do 3 jobs, and I see quite plainly that the OS has spit the CPU time in such a way that one process is given half the available resources and the other two share the rest. So, something like 15%, 7%, 7% might be how it works out. How do I take this into account? What is the “time” utility? — Thomas Adkins, Apr 11 '19 at 05:30

How do I compensate for not having a "quiet" machine when benchmarking my Java application?

1 Answers1