The performance difference between java.lang.System and Unsafe

Question

The System and Unsafe both offer some overlapped functionality ( For example, System.arraycopy v.s _UNSAFE.copyMemory).

In terms of implementations, it looks like both are relied on jni, is this a correct statement? (I could find unsafe.cpp but could not find the corresponding arraycopy implementation in JVM source code).

Also, if both are relied on JNI, could I say the invocation overhead to both of them are similar?

I know Unsafe could manipulate the offheap memory, but lets restrict our context on onheap memory here for the comparison.

Thanks for the answer.

Old, but probably still useful point to find the arraycopy implementation: https://stackoverflow.com/questions/11210369/openjdk-implementation-of-system-arraycopy -> search for objArrayKlass.cpp — Juraj Martinka, Aug 25 '22 at 04:01
It’s rather unlikely to be JNI, but rather some JVM internal call mechanism. And your assumed “invocation overhead” is not determining the actual copying performance. It’s not even guaranteed that an actual call happens, as the JIT compiler may replace a call to such a well known method by code tailored to the specific caller. — Holger, Aug 25 '22 at 07:34

score 2 · Accepted Answer · answered Aug 26 '22 at 17:14

Both System.arraycopy and Unsafe.copyMemory are HotSpot intrinsics. This means, JVM does not use JNI implementation when calling these methods from a JIT-compiled method. Instead, it replaces the call with an architecture-specific optimized assembly code.

You may find the sources in stubGenerator_<arch>.cpp.

Here is a simple JMH benchmark:

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;

import java.util.concurrent.ThreadLocalRandom;

import static one.nio.util.JavaInternals.byteArrayOffset;
import static one.nio.util.JavaInternals.unsafe;

@State(Scope.Benchmark)
public class CopyMemory {

    @Param({"12", "123", "1234", "12345", "123456"})
    int size;

    byte[] src;
    byte[] dst;

    @Setup
    public void setup() {
        src = new byte[size];
        dst = new byte[size];
        ThreadLocalRandom.current().nextBytes(src);
    }

    @Benchmark
    public void systemArrayCopy() {
        System.arraycopy(src, 0, dst, 0, src.length);
    }

    @Benchmark
    public void unsafeCopyMemory() {
        unsafe.copyMemory(src, byteArrayOffset, dst, byteArrayOffset, src.length);
    }
}

It confirms the performance of both methods is similar:

Benchmark                    (size)  Mode  Cnt     Score    Error  Units
CopyMemory.systemArrayCopy       12  avgt   16     5.294 ±  0.162  ns/op
CopyMemory.systemArrayCopy      123  avgt   16     7.057 ±  0.406  ns/op
CopyMemory.systemArrayCopy     1234  avgt   16    18.761 ±  0.492  ns/op
CopyMemory.systemArrayCopy    12345  avgt   16   353.386 ±  3.627  ns/op
CopyMemory.systemArrayCopy   123456  avgt   16  5234.125 ± 57.914  ns/op
CopyMemory.unsafeCopyMemory      12  avgt   16     5.028 ±  0.120  ns/op
CopyMemory.unsafeCopyMemory     123  avgt   16     8.055 ±  0.405  ns/op
CopyMemory.unsafeCopyMemory    1234  avgt   16    19.776 ±  0.523  ns/op
CopyMemory.unsafeCopyMemory   12345  avgt   16   353.549 ±  5.878  ns/op
CopyMemory.unsafeCopyMemory  123456  avgt   16  5246.298 ± 65.427  ns/op

If you run this JMH benchmark with -prof perfasm profiler, you'll see both methods boil down to exactly the same assembly loop:

# systemArrayCopy

  0.64%   ↗   0x00007fa95d4336d0:   vmovdqu -0x38(%rdi,%rdx,8),%ymm0
  2.81%   │   0x00007fa95d4336d6:   vmovdqu %ymm0,-0x38(%rsi,%rdx,8)
  5.67%   │   0x00007fa95d4336dc:   vmovdqu -0x18(%rdi,%rdx,8),%ymm1
 69.64%   │   0x00007fa95d4336e2:   vmovdqu %ymm1,-0x18(%rsi,%rdx,8)
 15.28%   │   0x00007fa95d4336e8:   add    $0x8,%rdx
          ╰   0x00007fa95d4336ec:   jle    Stub::jbyte_disjoint_arraycopy+112 0x00007fa95d4336d0

# unsafeCopyMemory
  
  1.08%   ↗   0x00007f2d39833af0:   vmovdqu -0x38(%rdi,%rdx,8),%ymm0
  3.09%   │   0x00007f2d39833af6:   vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
  5.78%   │   0x00007f2d39833afc:   vmovdqu -0x18(%rdi,%rdx,8),%ymm1
 66.44%   │   0x00007f2d39833b02:   vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
 19.00%   │   0x00007f2d39833b08:   add    $0x8,%rdx
          ╰   0x00007f2d39833b0c:   jle    Stub::jlong_disjoint_arraycopy+48 0x00007f2d39833af0

When working with regular arrays in Java heap, there is absolutely no need to use Unsafe API. The standard System.arraycopy is very well optimized. JDK class library itself uses System.arraycopy pretty much everywhere, including StringBuilder, ArrayList, ByteArrayOutputStream, etc.

Thanks for the answer. May i know if `Unsafe` is hotspot intrinsics, why `copyMemory` is still implemented in `unsafe.cpp` and it looks like a JNI function (`JNIEnv` is the first parameter of this `copyMemory`)? thank you. — Bostonian, Aug 26 '22 at 17:26
@Bostonian Intrinsic functions are called from JIT-compiled code. When a method runs in the interpreter, it still uses JNI implementation. — apangin, Aug 26 '22 at 17:35
Thank you for the answer. Does that imply if I implement a JNI method and call it from a Java method, this behavior would prevent this Java method get `JITTTED` due to the invocation of the JNI method? thank you. — Bostonian, Aug 26 '22 at 17:53
@Bostonian No, JNI method calls do not prevent from JIT compilation — apangin, Aug 26 '22 at 18:51

The performance difference between java.lang.System and Unsafe

1 Answers1