2

Disclaimer: I read this article by Alexey Shipilev and understood that nanobenchmarks a kind of evil. But anyway want to experiment and understand by myself.

I'm trying to measure array creation vs boxing of byte. Here is my benchmark:

@Fork(1)
@Warmup(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
@Measurement(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
public class MyBenchmark {

    @Benchmark
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @BenchmarkMode(Mode.AverageTime)
    public void arrayBenchmark(Blackhole bh) {
        byte[] b = new byte[1];
        b[0] = 20;
        bh.consume(b);
    }

    @Benchmark
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @BenchmarkMode(Mode.AverageTime)
    public void bonxingBenchmark(Blackhole bh) {
        bh.consume(new Byte((byte) 20));
    }
}

I ran this benchmark several times and for some reason I figured out that boxing is 1.5 times faster than creation of array and putting element into it

So I decided to run this benchmarks with -prof gc. The result is this:

MyBenchmark.arrayBenchmark                                     avgt    5     7.751 ±    0.537   ns/op
MyBenchmark.arrayBenchmark:·gc.alloc.rate                      avgt    5  1966.743 ±  143.624  MB/sec
MyBenchmark.arrayBenchmark:·gc.alloc.rate.norm                 avgt    5    24.000 ±    0.001    B/op
MyBenchmark.arrayBenchmark:·gc.churn.PS_Eden_Space             avgt    5  1966.231 ±  326.378  MB/sec
MyBenchmark.arrayBenchmark:·gc.churn.PS_Eden_Space.norm        avgt    5    23.999 ±    4.148    B/op
MyBenchmark.arrayBenchmark:·gc.churn.PS_Survivor_Space         avgt    5     0.042 ±    0.113  MB/sec
MyBenchmark.arrayBenchmark:·gc.churn.PS_Survivor_Space.norm    avgt    5     0.001 ±    0.001    B/op
MyBenchmark.arrayBenchmark:·gc.count                           avgt    5    37.000             counts
MyBenchmark.arrayBenchmark:·gc.time                            avgt    5    48.000                 ms

MyBenchmark.bonxingBenchmark                                   avgt    5     6.123 ±    1.306   ns/op
MyBenchmark.bonxingBenchmark:·gc.alloc.rate                    avgt    5  1664.504 ±  370.508  MB/sec
MyBenchmark.bonxingBenchmark:·gc.alloc.rate.norm               avgt    5    16.000 ±    0.001    B/op
MyBenchmark.bonxingBenchmark:·gc.churn.PS_Eden_Space           avgt    5  1644.547 ± 1004.476  MB/sec
MyBenchmark.bonxingBenchmark:·gc.churn.PS_Eden_Space.norm      avgt    5    15.769 ±    7.495    B/op
MyBenchmark.bonxingBenchmark:·gc.churn.PS_Survivor_Space       avgt    5     0.037 ±    0.067  MB/sec
MyBenchmark.bonxingBenchmark:·gc.churn.PS_Survivor_Space.norm  avgt    5    ≈ 10⁻³               B/op
MyBenchmark.bonxingBenchmark:·gc.count                         avgt    5    23.000             counts
MyBenchmark.bonxingBenchmark:·gc.time                          avgt    5    37.000                 ms

As we can see, the GC is heavily loaded in arrayBenchmark case. Allocation rate 1966 vs 1664. gc-count and gc-time also differs. I think that's the cause, but not sure

For now I don't quite understand that behaviour. I thought array allocation in my case just means that we allocate 1 byte somewhere. To me it looks pretty much the same as Boxing, but actually different.

Can you help me to understand it?

And what's most important... Can I trust this benchmark?

St.Antario
  • 26,175
  • 41
  • 130
  • 318
  • 3
    offtopic: but your 6666 reputation looks more than evil to me too... not even talking about the avatar pic.... – ΦXocę 웃 Пepeúpa ツ Aug 09 '17 at 13:22
  • @ΦXocę웃Пepeúpaツ Offtopic: give my +1 so my reputation is OK again :) – St.Antario Aug 09 '17 at 13:22
  • 1
    So, I could be wrong about this, but I believe what you might be seeing is the extraneous information that is stored due to the fact that it is an array, things like `.length` property – CraigR8806 Aug 09 '17 at 13:23
  • 1
    Done!! LOL ... now we hoppe you dont get 5 downvotes :) – ΦXocę 웃 Пepeúpa ツ Aug 09 '17 at 13:23
  • @CraigR8806 Actually, maybe.... But how can I make sure about it? Maybe there is some profiler for that? – St.Antario Aug 09 '17 at 13:25
  • You would have to have a runtime profiler that can inspect the local variable table and even then, I'm not sure that it would show you exactly what is being allocated. – CraigR8806 Aug 09 '17 at 13:27
  • @CraigR8806 The question is can I even trust this benchmark? – St.Antario Aug 09 '17 at 13:29
  • 2
    I've seen you ask a lot of very low level, nitty-gritty detail questions about Java. Can I ask what you're doing with all this? Is it just academic curiosity, or are you planning on something special? Comparing `Byte` with `byte[1]` is a bit like comparing apples with an array of oranges after all. – Kayaman Aug 09 '17 at 13:37
  • @Kayaman Just interesting to investigate. But have some troubles to understand what's going on under the hood. – St.Antario Aug 09 '17 at 13:38
  • 2
    With the array version you *may* incur the overhead of bounds checking when assigning the value to the array. You could try replacing it with `byte[] b = new byte[] { 20 };` and see what that does (if anything, it might be JITted away) to the benchmark. – Kayaman Aug 09 '17 at 13:44
  • @St.Antario what do you mean by "Can I trust this benchmark"? It is not clear what is the difference between your expectations and the data you have got from the benchmark, so it is not really clear what kind of answer you expect. – Oleg Estekhin Aug 14 '17 at 11:19
  • @OlegEstekhin I'm investigating the GC activity difference – St.Antario Aug 14 '17 at 11:19
  • You allocate different objects, that explains the difference, doesn't it? – Oleg Estekhin Aug 14 '17 at 11:20
  • @OlegEstekhin I thought arrays in java internally represented a contingious properly aligned memory, so it should have been more efficient to use arrays... but I was wrong. – St.Antario Aug 14 '17 at 11:21
  • actually, gc.alloc.rate.norm clearly shows, that array benchmark allocates 24 bytes per run, while wrapper allocates 16 – Oleg Estekhin Aug 14 '17 at 11:26
  • arrays are contingious and properly aligned, but how it leads to "more efficient" conclusion? – Oleg Estekhin Aug 14 '17 at 11:27
  • @OlegEstekhin it does not... I just want to understand the difference in gc activity. – St.Antario Aug 14 '17 at 23:28

1 Answers1

1

TL;DR it's because of Java objects memory layout* and array overhead.

Every object has header (where identity hashcode, locks and gc meta-information is stored) and class pointer. Additionally, address of object should be aligned to 8.

Assuming you are on x86-64 processor, header size is 8 byte and class pointer size is 4 bytes. byte field in Byte takes 1 byte => Byte should take 13 bytes, but rounding to alignment gives you exactly 16 bytes, which you observed with -prof gc.

For array calculations are mostly the same, but array has 4 byte length field (technically speaking, it's not a real field, but it doesn't really matter), which gives you 8 + 4 + 4 = 16 bytes for array and 1 byte for single byte element, which with alignment gives you 24 bytes.

So one of the reasons is that array is just larger object.

Second reason is more complex code for array creation (you can look at generated code with -prof perfasm or -prof dtraceasm). Array initialization requires additional store into length field, also (I don't know why) JIT generates four prefetchnta instructions.

So yes, you can partially trust it: allocating single element is slightly (though not x1.5 times) slower.

*Note that all this is relevant only to Hotspot and other JVMs may have different layout.

qwwdfsad
  • 3,077
  • 17
  • 25
  • Actually I have a question. You said that class pointers also stored in the headers, but... if we take a look at [markOop](http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/ff3b27e6bcc2/src/share/vm/oops/markOop.hpp#l44) we will find the header desription for 64bVM. No class pointer specified there. – St.Antario Feb 02 '18 at 15:53
  • Have never used `-dtraceasm`. Can you add a bit description about it? – St.Antario Feb 02 '18 at 15:54
  • @St.Antario I said the opposite: "Every object has header **and** class pointer." `dtraceasm` is analogue of `perfasm`, but works on Mac OS X. Added in JMH 1.20 – qwwdfsad Feb 02 '18 at 16:20
  • Thanks, understood. Just looked at assembly generated by `arrayBenchmark` method. Also noticed (3, not four like in your case) `prefetchnta`. Also I did not notice any `aastore` instruction (for `b[0] = 20`)... It was JITed from the jmh loop? – St.Antario Feb 02 '18 at 16:46
  • @St.Antario JIT behaviour may vary. Store should be present as `movq $0x14, 0x10(%rax)` right after allocation (there may be other register though, but constants should be the same). `$0x14` is hex constant for 20 – qwwdfsad Feb 02 '18 at 17:19
  • My bad, was looking for `12` instead of `14`... Now I see, thanks. Anyway, this is the code related to the JMH stub method. Seems the method was inlined there... Ran it with `CompilerControl.Mode.DONT_INLINE`, seems the `prefetchnta` we saw before were the jmh staff...? – St.Antario Feb 02 '18 at 17:30
  • @St.Antario it's related to `arrayBenchmark`, just inlined into JMH-stub. You can experiment with `@CompilerControl` or compilation log to validate it – qwwdfsad Feb 02 '18 at 17:34
  • 1
    @St.Antario `prefetchnta` is related to allocation code (to `new`, to prefetch area behind allocation to simplify next allocation), not to JMH infra. It will not be gone unless you explicitly disabled it with `-XX:AllocatePrefetchStyle=0` – qwwdfsad Feb 02 '18 at 17:36
  • I have one more question... Is it possible to attach with gdb to jmh benchmark when running? I mean `-perfasm` printed some code, I would like to debug it during benchmark execution. The issue is it's printed after benchmarking is finished. I can attach with gdb to a regular java process and debug it, that's pretty straightforward.... Can you advice something? Is it at all possible? – St.Antario Feb 02 '18 at 19:31
  • @St.Antario it's definitely possible, though I never tried to do it with running JMH benchmark. You need fast/slow-debug build of openjdk (you can find one here: https://builds.shipilev.net/). Next step is to watch one of Volker Simonis talks (e.g. google "volker simonis gdb") and then just run `gdb java ...` :) – qwwdfsad Feb 03 '18 at 16:28
  • the problem is when attaching to JMH fork the benchmark are running in the instruction addresses given by `-XX:PrintAssembly` seems to change over time so gdb just signals `cannot set break by the address`. With simple java app I added `Thread.sleep` after a method was compiled and debug it. It seems that debugging jmh benchamrks... If possible, but not as easy. – St.Antario Feb 04 '18 at 08:54
  • 1
    As I said, I've never tried this, so can't really help. But if generated code is changed then you are trying to attach to benchmark when it's not in steady state. Before steady state code is usually recompiled multiple times – qwwdfsad Feb 04 '18 at 15:38