How to benchmark host multi-threaded CPU performance in Java?

Question

I need to create a simple Java app that returns just one number: estimated CPU performance. For example when I run it on machine with 4 cores I will roughly get twice as big number than if run with 2 cores. This app should use 100% CPU for several seconds to measure that. I'm really not worried about accuracy.

I was really surprised that I couldn't find any Java library that already does that. Of course there are tools in other languages, but in my environment only Java is approved.

My current idea is to use classes from SciMark 2.0 in my code and run it from multiple threads, however this tool looks very messy (e.g. class names beginning with lowercase letters) and I need to write custom code to run these threads and combine the results.

Can I do any better to solve that problem?

CPU performance while doing what? It might matter what you are actually trying to measure. The normal way to do this is to measure the total time to complete a task. — markspace, Apr 10 '19 at 15:00
If you're on linux, just read `bogomips` value from `/proc/cpuinfo` — rkosegi, Apr 10 '19 at 15:28
@markspace I don't care. As I said accuracy doesn't matter at all for me, just rough numbers. Ideally I'm looking for ready solution with whatever assumptions. There will be various tasks to perform as these are Jenkins agens — Michal Kordas, Apr 10 '19 at 15:35
@rkosegi I cannot use `/proc/cpuinfo` for that, this benchmark must run on demand (VM performance may change without restart) — Michal Kordas, Apr 10 '19 at 15:38
Then I would just benchmark the task at hand, and record its performance. If that performance changes over time, you can investigate the change then. This is better because it measures the time for your actual task, not some arbitrary benchmark. — markspace, Apr 10 '19 at 15:47
@markspace My question is about how to find/write this task. It should use 100% CPU and be fairly simple (ideally library that already does that). All real tasks that will be performed on these virtual machines will use IO operations, but I don't want IO to influence my metric. By such benchmark I want to confirm that CPU performance has or hasn't dropped — Michal Kordas, Apr 10 '19 at 15:54
How hyperthreading-friendly do you want your microbenchmark to be? Do you want a machine with 2 logical cores per physical core to still show near-linear scaling with number of logical cores? Or only with physical cores by having your microbenchmark saturate FP ALU throughput of a whole core even with only one thread, for example? Or do you want some ILP, like enough to keep 3 or 4 but not 8 FP adds in flight at once, so you see a difference between Haswell and Skylake, or between Ryzen and Skylake? — Peter Cordes, Apr 11 '19 at 02:35
What results do you want it to give on Bulldozer-family where 2 (weak-ish) integer cores share an FPU/SIMD unit? (There are equivalent uarch differences for ISAs other than x86, e.g. POWER has 8-way SMT, but x86 CPUs are the most widely available). Also, do you want your benchmark to depend on any system-wide shared resources like memory-bandwidth? (Or in a NUMA system, socket-wide). — Peter Cordes, Apr 11 '19 at 02:37
@PeterCorders Whatever is simpler to code or whatever is supported by existing library — Michal Kordas, Apr 11 '19 at 11:59

score 3 · Answer 1 · answered Apr 15 '19 at 14:50

3

This is simplest piece of code that does what I wanted. It tries to estimate CPU performance for multiple threads by calculating sum of square roots for subsequent integers. Variable iterations could be adjusted to increase/decrease length of benchmark. On my machine with default values it takes about 7 seconds.

import static java.util.stream.IntStream.rangeClosed;

class Benchmark {
    public static void main(String[] args) {
        final int iterations = 100_000_000;
        long start = System.currentTimeMillis();
        rangeClosed(1, 50).parallel()
                .forEach(i -> rangeClosed(1, iterations).mapToDouble(Math::sqrt).sum());
        System.out.println(System.currentTimeMillis() - start);
    }
}

answered Apr 15 '19 at 14:50

Michal Kordas

10,475
7
58
103

Nice job providing a useful answer to your own question sans rhetoric – tsquared Jan 24 '20 at 22:57
1

So you're measuring FP sqrt throughput. That's oddly specific and not highly correlated with most FP workloads. e.g. you'll see a very big speedup going from Haswell vs. Broadwell and Skylake (like factor of 2) at the same clock speed, while mul/add/FMA throughput hasn't changed. BDW introduced a higher radix divide/sqrt unit. https://agner.org/optimize/ instruction tables; look at `sqrtsd` throughput (8 to 14 vs. 4 to 8) assuming it doesn't vectorize with SIMD. If it does vectorize, Skylake would provide another big speedup for 128 and 256-bit wide vectors. – Peter Cordes Jan 25 '20 at 05:13
@PeterCordes I'm not using it to measure which virtual machine is better or faster. I need it just to do sanity check that my machine performs roughly the same as it did some time ago and still gets similar amount of CPU cycles from the bare metal. – Michal Kordas Jan 25 '20 at 23:09
Ok, yes, it should run the same way every time on the same hardware. (Or faster if a JVM ever figures out how to auto-vectorize, or even optimize away a sum that you don't assign anywhere.) Scaling with number of threads probably won't be helped by hyperthreading; a single hardware thread running this (if it JITs anywhere near efficiently) can probably saturate the sqrt unit. – Peter Cordes Jan 25 '20 at 23:43
@PeterCordes you are right, in that case looks like Docker container with hardcoded Java version would be a solution for the stability of the result – Michal Kordas Jan 27 '20 at 09:51
@MichalKordas: Depends what you want to test! If you want to see if your simple benchmark techniques still work, or if your JVM has improved, use this. Or add printing of the final sum so it can't optimize away (as easily), but could still speed up with SIMD so you maybe get a pleasant surprise sometime. – Peter Cordes Jan 27 '20 at 10:06

Stephen C · Answer 2 · 2019-04-10T15:24:36.997

2

If I understand you correctly, your goal is to measure system performance rather than application performance.

Here's the problem. System performance cannot be reduced to a single meaningful number. In reality, system performance ... even CPU performance is multi-dimensional.

For example, an application that memory intensive will perform differently on different machines depending on the CPU chip's memory cache size and design ... and the memory speed. But if the application is compute intensive, then the performance will depend more on the clock rate and core count.

Then there are issues like the effects of NUMA cells and thread pinning when the core count is high and/or you have multiple CPU chips.

These and similar issues are why benchmarks that attempt to measure raw CPU performance independent of the application have largely fallen out of favor. (MIPS originally meant million (hardware) instructions per second. It is now often referred as mythical instructions per second ... alluding to the bogosity of the measure as a predictor of real application performance)

edited Apr 10 '19 at 15:24

answered Apr 10 '19 at 15:16

Stephen C

698,415
94
811
1,216

Fully agreed. As I highlighted I need to have very rough number. I don't care about details. I just need to detect that for some reason performance of this particular virtual machine has dropped over time (e.g. because physical server was over-allocated). And I care only about order of magnitude changes, e.g. this VM was able to calculate 1M digits of PI yesterday in 1 minute but today it took 10 minutes, so something is definitely wrong. – Michal Kordas Apr 10 '19 at 15:44
1

Well ... if you are looking for a random meaningless indicator, calculate the first D digits of Pi N times in N parallel threads. And measure clock time or cpu time using one of these: https://stackoverflow.com/a/7467299/139985 – Stephen C Apr 11 '19 at 04:32
But if your goal is to measure the performance of a VM whose performance you suspect is dropping due to over-commit, then a benchmark that measures CPU performance of single or multiple threads is not enough. Why? Because you also need to consider RAM over-commit and I/O or device saturation. For a typical Java application, these things can have a much more severe impact on performance than simple CPU <-> VCPU over-commit. – Stephen C May 06 '23 at 09:32

Gonzalo Matheu · Answer 3 · 2019-04-10T15:15:49.473

0

Java Mcrobenchmark Harness (JMH) is a toolkit to implement benchmarks of Java code.

It measures Throughput or Average Time; you could use that to estimate cpu cycles.

Basically, you need to annotate with @Benchmark the method you want to benchmark. This method

Thare are few JMH usage samples in their repository.

It is always recommended to let the computer alone while it runs the benchmarks, and you should close all other applications (if possible). If your computer is running other applications, these applications may take time from the CPU and give incorrect (lower) performance numbers.

If you want to dig further in CPU performance (cycles, cache usages, instructions, etc) you will probably need to use Linux perf

edited Apr 10 '19 at 15:15

answered Apr 10 '19 at 15:02

Gonzalo Matheu

8,984
5
35
58

I don't need to measure performance of the code. I'm looking for a Java library (or idea how to write such library) that will trigger some CPU-heavy tasks that exercise all available threads for configurable amount of time and as a result I will get number that roughly says something about current CPU performance of this VM. – Michal Kordas Apr 10 '19 at 15:47
JMH's [Blackhole](http://hg.openjdk.java.net/code-tools/jmh/file/cde312963a3d/jmh-core/src/main/java/org/openjdk/jmh/logic/BlackHole.java#l413) class has *consumeCPU* method that just consumes CPUs avoiding JIT optimizations – Gonzalo Matheu Apr 10 '19 at 17:15
OK, but this is still single threaded. I'm looking rather for `consumeAllCpus(long tokens)` method, otherwise it's perfect. – Michal Kordas Apr 10 '19 at 19:14

score 0 · Answer 4 · answered Jan 24 '20 at 23:04

Michal, thanks for your answer, I borrowed and added some threading to help me diagnose a virtual CPU performance issue on a client's AIX machine.

import static java.util.stream.IntStream.rangeClosed;

public class Main {

    public static void main(String[] args) {
        if (args.length < 2) {
            System.out.println("Usage: benchmark [million iterations] [maxThreads]");
            return;
        }

        final int MILLION = 1_000_000;
        final int iterations = Integer.parseInt(args[0]);
        final int maxThreads = Integer.parseInt(args[1]);

        for (int threads = 1; threads < maxThreads; threads++) {
            long start = System.currentTimeMillis();
            int count = iterations * MILLION / threads;
            rangeClosed(1, threads).parallel()
                .forEach(i -> rangeClosed(1, count).mapToDouble(Math::sqrt).sum());

            System.out.println(String.format("Benchmark of %d M iterations on %d thread(s): %d ms", iterations, threads, System.currentTimeMillis() - start));
        }

    }

}

How to benchmark host multi-threaded CPU performance in Java?

4 Answers4