Why is byte addition performance so unpredictable?

Question

A few hours ago I answered another Stack Overflow question, and it gave a very surprising result. The answer can be found here. The answer was/is partially wrong, however I feel focused on byte addition.

Strictly, it actually is byte-to-long addition.

This is the benchmark code I have been using:

public class ByteAdditionBenchmark {
    private void start() {
        int[] sizes = {
            700_000,
            1_000,
            10_000,
            25_000,
            50_000,
            100_000,
            200_000,
            300_000,
            400_000,
            500_000,
            600_000,
            700_000,
        };

        for (int size : sizes) {
            List<byte[]> arrays = createByteArrays(size);
            //Warmup
            arrays.forEach(this::byteArrayCheck);
            benchmark(arrays, this::byteArrayCheck, "byteArrayCheck");
        }
    }

    private void benchmark(final List<byte[]> arrays, final Consumer<byte[]> method, final String name) {
        long start = System.nanoTime();
        arrays.forEach(method);
        long end = System.nanoTime();
        double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
        System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + " ns");
    }

    private List<byte[]> createByteArrays(final int amount) {
        Random random = new Random();
        List<byte[]> resultList = new ArrayList<>();
        for (int i = 0; i < amount; i++) {
            byte[] byteArray = new byte[4096];
            byteArray[random.nextInt(4096)] = 1;
            resultList.add(byteArray);
        }
        return resultList;
    }

    private boolean byteArrayCheck(final byte[] array) {
        long sum = 0L;
        for (byte b : array) {
            sum += b;
        }
        return (sum == 0);
    }

    public static void main(String[] args) {
        new ByteAdditionBenchmark().start();
    }
}

And this are the results I have been getting:

Benchmark: byteArrayCheck / iterations: 700000 / time per iteration: 50.26538857142857 ns
Benchmark: byteArrayCheck / iterations: 1000 / time per iteration: 20.12 ns
Benchmark: byteArrayCheck / iterations: 10000 / time per iteration: 9.1289 ns
Benchmark: byteArrayCheck / iterations: 25000 / time per iteration: 10.02972 ns
Benchmark: byteArrayCheck / iterations: 50000 / time per iteration: 9.04478 ns
Benchmark: byteArrayCheck / iterations: 100000 / time per iteration: 18.44992 ns
Benchmark: byteArrayCheck / iterations: 200000 / time per iteration: 15.48304 ns
Benchmark: byteArrayCheck / iterations: 300000 / time per iteration: 15.806353333333334 ns
Benchmark: byteArrayCheck / iterations: 400000 / time per iteration: 16.923685 ns
Benchmark: byteArrayCheck / iterations: 500000 / time per iteration: 16.131066 ns
Benchmark: byteArrayCheck / iterations: 600000 / time per iteration: 16.435461666666665 ns
Benchmark: byteArrayCheck / iterations: 700000 / time per iteration: 17.107615714285714 ns

To my knowledge, the JVM is already fully warmed up after the first 700000 iterations before it starts to spit out benchmarking data.

How can it then be that, despite the warmup, the performance is still unpredictable? As almost directly after the warmup byte addition gets blazingly fast, but after that it seems to converge again to a nominal 16 ns per addition again.

The tests have been ran on a PC with an Intel i7 3770 stock-clocked and 16 GB of RAM, hence I can't go much beyond 700000 iterations. It is running on Windows 8.1 64-bit if that matters.

It turns out that the JIT was optimizing everything away, as per raphw's suggestion.

Therefore I replaced the benchmark method with the following:

private void benchmark(final List<byte[]> arrays, final Predicate<byte[]> method, final String name) {
    long start = System.nanoTime();
    boolean someUnrelatedResult = false;
    for (byte[] array : arrays) {
        someUnrelatedResult |= method.test(array);
    }
    long end = System.nanoTime();
    double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
    System.out.println("Result: " + someUnrelatedResult);
    System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
}

This will ensure that it cannot be optimized away and test results also show it (omitted the result print for clarity):

Benchmark: byteArrayCheck / iterations: 700000 / time per iteration: 1658.2627914285715 ns
Benchmark: byteArrayCheck / iterations: 1000 / time per iteration: 1241.706 ns
Benchmark: byteArrayCheck / iterations: 10000 / time per iteration: 1215.941 ns
Benchmark: byteArrayCheck / iterations: 25000 / time per iteration: 1332.94656 ns
Benchmark: byteArrayCheck / iterations: 50000 / time per iteration: 1456.0361 ns
Benchmark: byteArrayCheck / iterations: 100000 / time per iteration: 1753.26777 ns
Benchmark: byteArrayCheck / iterations: 200000 / time per iteration: 1756.93283 ns
Benchmark: byteArrayCheck / iterations: 300000 / time per iteration: 1762.9992266666666 ns
Benchmark: byteArrayCheck / iterations: 400000 / time per iteration: 1806.854815 ns
Benchmark: byteArrayCheck / iterations: 500000 / time per iteration: 1784.09091 ns
Benchmark: byteArrayCheck / iterations: 600000 / time per iteration: 1804.6096366666666 ns
Benchmark: byteArrayCheck / iterations: 700000 / time per iteration: 1811.0597585714286 ns

I'd say that these results look a lot more convincing with respect to the computation time. However, my question still stands. With repeated tests at random times the same pattern remains that benchmarks with a low number of iterations are faster than those with more iterations, though they do seem to stabilize at 100,000 iterations or somewhere lower.

What is the explanation?

You are creating fresh arrays for every benchmark round. Therefore the garbage collector may have an impact on the result. — Robert, May 23 '14 at 12:43
@Robert Unfortunately I can't correct it that well, as I have no space to store everything in memory. And leaving out some data may influence the test. — skiwi, May 23 '14 at 12:54
@skiwi What happens if you make some calls to `System.gc()`? I know it won't guarantee gc, but from one of Jon Skeet's observations (can't find the question...) it happens pretty much all the time anyways. You should be able to set a JVM flag too that will print when gc occurs — awksp, May 23 '14 at 12:58
Use a proper benchmarking framework which prevents loads of uncertain things, or this simply isn't valid or serious. Start with investigating JMH or Caliper. — TC1, May 23 '14 at 19:27
You have a 64kbyte L1 cache, by the looks of things, which is responsible for the difference. — pjc50, May 23 '14 at 20:17
I can only repeat myself: Write a harnessed benchmark. At least for future attempts, I have done this for you this time. I recommend you to use JMH. Alternatively, use Caliper, if you must but I like JMH better. — Rafael Winterhalter, May 24 '14 at 13:12
possible duplicate of [How do I write a correct micro-benchmark in Java?](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) — Raedwald, May 27 '14 at 07:20
To the editors and mindless approve-voters: Do *not* mess up the indentation of the `sizes` array please. — skiwi, Jun 03 '14 at 13:35

Rafael Winterhalter · Accepted Answer · 2014-05-24T13:32:48.230

The reason for your result is that you do not actually know what you are measuring. Java's just-in-time compiler is most certainly having a look at your code and it might just happen that you are measuring nothing.

The compiler is smart enough to figure out that your List<byte[]> is not actually used for anything. Therefore, it will eventually remove all the related code from your running application. Thus, your benchmark is most likely measuring an increasingly empty application.

The answer to all such questions is always: It is not worth having a discussion before we actually looked at a valid benchmark. Benchmarking harnesses such as the JMH (which I can recommend) know a concept called a black hole. Black holes are meant to confuse the just-in-time compiler in order to think that a computed value is actually used for something even though it is not. With such black holes, otherwise code that is erased as no-op will remain.

Another typical problem of home-grown benchmarks are optimized loops. Again, the just-in-time compiler will notice that the loop results in the same computation for any iteration and will therefore remove the loop altogether. With a (quality) benchmarking harness, you will only suggest a number of loops to run instead of hard-coding them. This way, the harness can take care of tricking the compiler.

Write a benchmark with JMH, you will see that your measured times will be quite different.

Concerning your update: I can only repeat myself. Never trust an unharnessed benchmark! An easy way to find out about what the JVM is doing to your code is running JITwatch. The main concern with your benchmark is that it ignores the JVM's profiling. A profile is an attempt of the JVM to remember properties of your code which it then bases its optimization on. For your benchmark, you mix the profiles of different runs together. The JVM then has to renew its current profile and recompile the byte code on the fly what costs time.

In order to avoid this issue, a harnesses like JMH lets you fork a JVM new process for each benchmark. Here is what I am measuring with a harnessed benchmark:

Benchmark                    Mode   Samples         Mean   Mean error    Units
o.s.MyBenchmark.test100k     avgt        20     1922.671       29.155    ns/op
o.s.MyBenchmark.test10k      avgt        20     1911.152       13.217    ns/op
o.s.MyBenchmark.test1k       avgt        20     1857.205        3.086    ns/op
o.s.MyBenchmark.test200k     avgt        20     1905.360       18.102    ns/op
o.s.MyBenchmark.test25k      avgt        20     1832.663      102.562    ns/op
o.s.MyBenchmark.test50k      avgt        20     1907.488       18.043    ns/op

And here is the source code to the benchmark based on the mentioned JMH:

@State(Scope.Benchmark)
public class MyBenchmark {

    private List<byte[]> input1k, input10k, input25k, input50k, input100k, input200k;

    @Setup
    public void setUp() {
        input1k = createByteArray(1_000);
        input10k = createByteArray(10_000);
        input25k = createByteArray(25_000);
        input50k = createByteArray(50_000);
        input100k = createByteArray(100_000);
        input200k = createByteArray(200_000);
    }

    private static List<byte[]> createByteArray(int length) {
        Random random = new Random();
        List<byte[]> resultList = new ArrayList<>();
        for (int i = 0; i < length; i++) {
            byte[] byteArray = new byte[4096];
            byteArray[random.nextInt(4096)] = 1;
            resultList.add(byteArray);
        }
        return resultList;
    }

    @GenerateMicroBenchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @OperationsPerInvocation(1_000)
    public boolean test1k() {
        return runBenchmark(input1k, this::byteArrayCheck);
    }

    @GenerateMicroBenchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @OperationsPerInvocation(10_000)
    public boolean test10k() {
        return runBenchmark(input10k, this::byteArrayCheck);
    }

    @GenerateMicroBenchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @OperationsPerInvocation(25_000)
    public boolean test25k() {
        return runBenchmark(input25k, this::byteArrayCheck);
    }

    @GenerateMicroBenchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @OperationsPerInvocation(50_000)
    public boolean test50k() {
        return runBenchmark(input50k, this::byteArrayCheck);
    }

    @GenerateMicroBenchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @OperationsPerInvocation(100_000)
    public boolean test100k() {
        return runBenchmark(input100k, this::byteArrayCheck);
    }

    @GenerateMicroBenchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @OperationsPerInvocation(200_000)
    public boolean test200k() {
        return runBenchmark(input200k, this::byteArrayCheck);
    }

    private static boolean runBenchmark(List<byte[]> arrays, Predicate<byte[]> method) {
        boolean someUnrelatedResult = false;
        for (byte[] array : arrays) {
            someUnrelatedResult |= method.test(array);
        }
        return someUnrelatedResult;
    }

    private boolean byteArrayCheck(final byte[] array) {
        long sum = 0L;
        for (byte b : array) {
            sum += b;
        }
        return (sum == 0);
    }

    public static void main(String[] args) throws RunnerException {
        new Runner(new OptionsBuilder()
                .include(".*" + MyBenchmark.class.getSimpleName() + ".*")
                .forks(1)
                .build()).run();
    }
}

I don't believe you are entirely correct. The `List` is in fact being used for something -- calculating `sum`, which is then printed, so the first optimization you describe can't occur. — awksp, May 23 '14 at 13:08
This is why benchmarking is often tricky. Unless you're writing them in machine specific, hardware near code (like assembly) you really have no idea what you're benchmarking. Try running your benchmarking code through a profiler, you'll see. — Mikkel Løkke, May 23 '14 at 13:08
@user3580294: where in the entire code is `sum` printed? Actually the `sum` is not used at all. — Holger, May 23 '14 at 13:13
@user3580294 No, it is never really used. All of your call sites are monomorphic. This makes it super easy for the JIT compiler to figure out that the generated `List` never leaves the `start` method. It is never persisted, printed, serialized or any the like. It will do a simple escape analysis and drop the entire list creation. A black hole will emulate this property for you. — Rafael Winterhalter, May 23 '14 at 13:13
Oops, it's not printed. It *is* used for a boolean check, but that doesn't appear to be used. My mistake. — awksp, May 23 '14 at 13:14
@user3580294 This is why you should write a harnessed benchmark. There are plenty examples deployed with the JMH. But do not forget that a harness does neither guarantee you a specific measurement. You need a tool like *JIT watch* if you want to know for sure what happens under the covers. — Rafael Winterhalter, May 23 '14 at 13:16
Yeah, but even a simple change from `Consumer` to `Predicate` and a conditional `println` based on the result (it will never get executed as the arrays aren’t empty) will change the result to plausible values. — Holger, May 23 '14 at 13:20
You should add some code to verify what you have said, as it *seems* to be correct, but it wouldn't be correct for me to change the question now. (Hint: You should change `Consumer method` to `Predicate method`. If it's true, then it is quite nasty... I haven't used a proper test harnass because I wasn't able to set one up easily hence decided to roll my own. Still I do not see the real issue with using my own, unless the proven one would have warned me that I was calculating hot air. — skiwi, May 23 '14 at 13:20
Hey, I wasn't the one writing the original test. But looking at the JIT compilation statements, it appears that the vast majority of JIT compilations occur during the warmup phase, and most of the changes after (<=10 total) are in core Java libraries... — awksp, May 23 '14 at 13:21
However if that was "the issue", then my question still remains valid as the only thing that changed is that all numbers are roughly a factor 100 higher... With still the same pattern emerging. — skiwi, May 23 '14 at 13:22
@skiwi: post your code. Btw. on my machine even executing a no-op lambda takes about 10 to 20ns, just like your methods. It’s a good indicator that your methods were optimized away. — Holger, May 23 '14 at 13:25
@user3580294 True, sorry. Multitabbrowsing of SO got me mixed up. In general, I can only recommend this article on why one really, really always use a harness even for simple benchmarks: http://www.ibm.com/developerworks/library/j-jtp02225/ — Rafael Winterhalter, May 23 '14 at 13:30
@raphw It's fine, I've done that myself enough times. I'll be sure to give that a read! Thanks! — awksp, May 23 '14 at 13:41

score 2 · Answer 2 · answered May 23 '14 at 13:08

For 1000 iterations, you are just measuring the overhead of method call, measuring time etc. which outweighs the time to do the actual work. More than 50,000 iterations, your processor runs out of L1 cache and slows down. Depending on the cache size of your processor, you'll probably have another slowdown at a few million iterations when the data doesn't fit into L2 cache anymore.

Your processor has 8MB cache, so at that number of iterations you should get the next slowdown. You can change the test by adding say only every fourth byte, and you'll see that your time doesn't improve, because it's not the operations but the memory bandwidth that costs the time.

score 2 · Answer 3 · edited May 23 '17 at 12:13

A simple change to your benchmark method makes a huge difference:

private void benchmark(final List<byte[]> arrays, final Predicate<byte[]> method, final String name) {
    long start = System.nanoTime();
    arrays.forEach(a -> { if(method.test(a)) System.out.println(); });
    long end = System.nanoTime();
    double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
    System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
}

Here, the result is actually used from the JVM’s point of view. While getting roughly the same values for your original code on my machine, after the change I got:

Benchmark: byteArrayCheck / iterations: 300000 / time per iteration: 1447.9460033333332ns
Benchmark: byteArrayCheck / iterations: 1000 / time per iteration: 3801.986ns
Benchmark: byteArrayCheck / iterations: 10000 / time per iteration: 3319.9504ns
Benchmark: byteArrayCheck / iterations: 25000 / time per iteration: 1929.62352ns
Benchmark: byteArrayCheck / iterations: 50000 / time per iteration: 1943.07152ns
Benchmark: byteArrayCheck / iterations: 100000 / time per iteration: 1928.07745ns
Benchmark: byteArrayCheck / iterations: 200000 / time per iteration: 1915.344575ns
Benchmark: byteArrayCheck / iterations: 300000 / time per iteration: 1918.1994833333333ns
Benchmark: byteArrayCheck / iterations: 400000 / time per iteration: 1913.248085ns

(I skipped the higher numbers because of insufficient RAM)

It shows, that there’s a fixed overhead that becomes negligible with bigger numbers, but also, that fluctuations in the range of 10 to 20 nanoseconds are irrelevant.

I want to emphasize that this is still not a reliable benchmark (if there ever can be one). But it is good enough to indicate that raphw’s answer has a valid point.

I have just updated my question to incorporate this fix. However for me benchmarks with a lower number of iterations remain to be significantly faster as the ones with a higher number of iterations. However interestingly enough in your case the timings are even more unpredictable. — skiwi, May 23 '14 at 15:30
@skiwi: as you might see from my printout, I had a smaller warm-up (because I had lesser RAM), and you have changed your method to use a `for`-loop rather than `forEach(lambda)`. The difference for smaller arrays seems to be a side effect of a pause I added for gc (I guess, the thread ran on a different core after the pause and had to refill caches). After changing the code to be closer to yours, I get almost the same values as you. — Holger, May 23 '14 at 15:47
The main problem that remains is the JVM's profiling. With each new benchmark, you mix profiles for different sizes. To overcome this, you need to fork a different JVM process for any run. I wrote a benchmark to settle this. See my updated answer. — Rafael Winterhalter, May 24 '14 at 13:40

score 1 · Answer 4 · edited May 23 '14 at 17:38

1

This could be many things. Among them: Windows and clocks.

Windows: Even if you are running nothing else the system might decide that it needs the core your code is running on to polish up some graphics or dust some long forgotten files.

Clocks: It's called System.nanoTime(), but this doesn't mean that the values change that fast. A while back I did a test on 'System.currentTimeMillis()' and the value only changed every 10 ms.

edited May 23 '14 at 17:38

Peter Mortensen

30,738
21
105
131

answered May 23 '14 at 12:30

Dawnkeeper

2,844
1
25
41

The pattern of the results (not the exact same numbers obviously) is consistent over lots of randomly timed benchmarks. – skiwi May 23 '14 at 12:31
Also, the fact that OP's running a multicore processor and the code doesn't appear to be threaded means that it's highly unlikely the OS is bumping the timing thread off to do random work. – awksp May 23 '14 at 12:44
If you really get the same results every time that's strange. Depending on what happens you should get randomly higher results on every run and the results for higher sample numbers should be more stable as the longer running time eliminates the error from the timer resolution. – Dawnkeeper May 23 '14 at 12:54
Also the low sample number results should tend to bump into the smaller execution times per sample due to the timer graininess. – Dawnkeeper May 23 '14 at 13:00
+1 for the description of the scatter brained Windows kernel. "dust some long forgotten files" was funny quote of the day. Yes it's been that kind of day for me. – Floris May 23 '14 at 19:05
1

You just found out the reason yourself why you should never use `System.currentTimeMillis` for benchmarking (well one of them, there are more). `nanoTime` uses the high performance counters (completely different API) and is way more accurate than 10ms. – Voo May 23 '14 at 19:51

score 1 · Answer 5 · edited May 23 '14 at 17:40

1

Like a lot of things in computer science, it depends. Working with a Windows 7 OS as Dawnkeeper pointed out may be part of the issue.

The reality is that all processes on a computer share the CPU (even multi-core CPUs). So your process is just one of dozens, perhaps hundreds of processes that all need time on the CPU. Your process probably has a higher priority, so it will spend more time on the CPU than say, the process that cleans files in the background (again, pointed out by Dawnkeeper).

Something that sometimes compounds the CPU sharing is processes that get involved in I/O. Whenever something needs to print to the screen or get something from disk, it is SLOW. Every time a process is kicked off the CPU it does one of two things. If it's a 'nice' process, it will save where it is and shut everything down and step off. If the process is involved in I/O, this will take some time. The other option is that the process is 'important' and will continue in its task until it reaches a good point to stop. This is not unlike when someone says "Hey, I need to talk to you" and you respond "This YouTube video will be over in 20 seconds, hold on".

I hope that helps. The JVM is just another process on in the computers eyes.

EDIT: clarification questions -- how are you handling those print statements? Are they being printed to the screen? Written to a file? Stored in memory until execution is complete and THEN written to a file?

EDIT 2: this may help you change the priority.

edited May 23 '14 at 17:40

Peter Mortensen

30,738
21
105
131

answered May 23 '14 at 12:48

Kyte

834
2
12
27

The important parts of this program involves "slow" I/O in no way, shape, or form, though. – awksp May 23 '14 at 12:50
If you look at the code, you can see that printing is done after each round of the benchmark is complete, outside of any timing. – awksp May 23 '14 at 12:51
in that case I'd say the discrepancies have more to do with how the CPU scheduler works. You can do things to make your JVM thread have a higher priority. It'd be interesting to see what the times are between manually setting the thread priority high and low – Kyte May 23 '14 at 12:52
Try maximizing the priority of your process and see if you get the same results. – Daniel May 23 '14 at 12:52
http://www.sevenforums.com/tutorials/83361-priority-level-set-applications-processes.html This may help you set the priority. – Kyte May 23 '14 at 12:53
To be honest, I highly doubt that the CPU scheduler on a multicore processor would make enough of a difference to affect things, especially if you aren't doing anything else on the computer at the same time. – awksp May 23 '14 at 13:04
Also, ran a quick test. No significant difference between highest and lowest priority on my machine. – awksp May 23 '14 at 13:05
Ok, then I am out of ideas my friend. My desktop at home is nearly identical to yours, nice build! – Kyte May 23 '14 at 13:06
I'm not OP, by the way. Just happen to be pretty interested in this problem, since it's related to another question he answered that I'm following – awksp May 23 '14 at 13:12
that's what I get for following 4 questions at once – Kyte May 23 '14 at 13:19
@Kyte all the issues you mention are actually pretty minor - on a "normal" recent computer, the CPU consumption is pretty low, most of the time... The op has just fallen in one of the many pitfalls of micro benchmarking... – assylias Jun 06 '14 at 05:08

Why is byte addition performance so unpredictable?

5 Answers5