12

Is get/put from a non-direct bytebuffer faster than get/put from direct bytebuffer ?

If I have to read / write from direct bytebuffer , is it better to first read /write in to a thread local byte array and then update ( for writes ) the direct bytebuffer fully with the byte array ?

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
user882659
  • 161
  • 1
  • 9

2 Answers2

25

Is get/put from a non-direct bytebuffer faster than get/put from direct bytebuffer ?

If you are comparing heap buffer with direct buffer which does not use native byte order (most systems are little endian and the default for direct ByteBuffer is big endian), the performance is very similar.

If you use native ordered byte buffers the performance can be significantly better for multi-byte values. For byte it makes little difference no matter what you do.

In HotSpot/OpenJDK, ByteBuffer uses the Unsafe class and many of the native methods are treated as intrinsics. This is JVM dependent and AFAIK the Android VM treats it as an intrinsic in recent versions.

If you dump the assembly generated you can see the intrinsics in Unsafe are turned in one machine code instruction. i.e. they don't have the overhead of a JNI call.

In fact if you are into micro-tuning you may find that most of the time of a ByteBuffer getXxxx or setXxxx is spend in the bounds checking, not the actual memory access. For this reason I still use Unsafe directly when I have to for maximum performance (Note: this is discouraged by Oracle)

If I have to read / write from direct bytebuffer , is it better to first read /write in to a thread local byte array and then update ( for writes ) the direct bytebuffer fully with the byte array ?

I would hate to see what that is better than. ;) It sounds very complicated.

Often the simplest solutions are better and faster.


You can test this yourself with this code.

public static void main(String... args) {
    ByteBuffer bb1 = ByteBuffer.allocateDirect(256 * 1024).order(ByteOrder.nativeOrder());
    ByteBuffer bb2 = ByteBuffer.allocateDirect(256 * 1024).order(ByteOrder.nativeOrder());
    for (int i = 0; i < 10; i++)
        runTest(bb1, bb2);
}

private static void runTest(ByteBuffer bb1, ByteBuffer bb2) {
    bb1.clear();
    bb2.clear();
    long start = System.nanoTime();
    int count = 0;
    while (bb2.remaining() > 0)
        bb2.putInt(bb1.getInt());
    long time = System.nanoTime() - start;
    int operations = bb1.capacity() / 4 * 2;
    System.out.printf("Each putInt/getInt took an average of %.1f ns%n", (double) time / operations);
}

prints

Each putInt/getInt took an average of 83.9 ns
Each putInt/getInt took an average of 1.4 ns
Each putInt/getInt took an average of 34.7 ns
Each putInt/getInt took an average of 1.3 ns
Each putInt/getInt took an average of 1.2 ns
Each putInt/getInt took an average of 1.3 ns
Each putInt/getInt took an average of 1.2 ns
Each putInt/getInt took an average of 1.2 ns
Each putInt/getInt took an average of 1.2 ns
Each putInt/getInt took an average of 1.2 ns

I am pretty sure a JNI call takes longer than 1.2 ns.


To demonstrate that its not the "JNI" call but the guff around it which causes the delay. You can write the same loop using Unsafe directly.

public static void main(String... args) {
    ByteBuffer bb1 = ByteBuffer.allocateDirect(256 * 1024).order(ByteOrder.nativeOrder());
    ByteBuffer bb2 = ByteBuffer.allocateDirect(256 * 1024).order(ByteOrder.nativeOrder());
    for (int i = 0; i < 10; i++)
        runTest(bb1, bb2);
}

private static void runTest(ByteBuffer bb1, ByteBuffer bb2) {
    Unsafe unsafe = getTheUnsafe();
    long start = System.nanoTime();
    long addr1 = ((DirectBuffer) bb1).address();
    long addr2 = ((DirectBuffer) bb2).address();
    for (int i = 0, len = Math.min(bb1.capacity(), bb2.capacity()); i < len; i += 4)
        unsafe.putInt(addr1 + i, unsafe.getInt(addr2 + i));
    long time = System.nanoTime() - start;
    int operations = bb1.capacity() / 4 * 2;
    System.out.printf("Each putInt/getInt took an average of %.1f ns%n", (double) time / operations);
}

public static Unsafe getTheUnsafe() {
    try {
        Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
        theUnsafe.setAccessible(true);
        return (Unsafe) theUnsafe.get(null);
    } catch (Exception e) {
        throw new AssertionError(e);
    }
}

prints

Each putInt/getInt took an average of 40.4 ns
Each putInt/getInt took an average of 44.4 ns
Each putInt/getInt took an average of 0.4 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns

So you can see that the native call is much faster than you might expect for a JNI call. The main reason for this delay could be the L2 cache speed. ;)

All run on an i3 3.3 GHz

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Thanks Peter . This is very useful . BTW , Why does oracle recommend not to use Unsafe directly . If we use it in production code what pitfalls might arise ? – user882659 Jun 24 '12 at 10:58
  • 1
    As the name suggests, its unsafe to use and an error can crash the system. i.e. its faster because all the protections are off. What I do is have two implementations, One which uses ByteBuffers as is and another which uses Unsafe. When I am confident in the testing of the software and I needed it, you can drop in the unsafe version. – Peter Lawrey Jun 24 '12 at 11:42
  • 4
    In fact I have used Unsafe to deliberately crash the system e.g. I want to test what happens if the application crashes **here** ;) – Peter Lawrey Jun 24 '12 at 11:53
  • @Peter, your 1st example should use absolute get/put, `bb2.putInt(i, bb1.getInt(i))` which is significantly faster. – ZhongYu Aug 14 '13 at 00:30
  • @Peter I used your 1st test, replaced the loop with absolute get/put, and saw that time is cut by 40%, presumably because there's no pointer update in absolute get/put – ZhongYu Aug 14 '13 at 15:29
  • this is the fastest "ok" implementation I was able to get: IntBuffer intBuffer1 = bb1.asIntBuffer(); IntBuffer intBuffer2 = bb2.asIntBuffer(); int count = intBuffer1.remaining(); for (int i = 0; i < count; i++) { intBuffer2.put(i, intBuffer1.get(i)); } – Matej Tymes Oct 29 '13 at 20:10
  • @MatejTymes 0.3 ns is one clock cycle. If you can get less than this, your code has probably been optimised away. ;) – Peter Lawrey Oct 29 '13 at 20:11
  • i mean fastest without using unsafe :) – Matej Tymes Oct 29 '13 at 20:13
  • @MatejTymes If you can put one ByteBuffer into another it will use Unsgae.copyMemory under the covers. This will be faster than copying one `int` at a time. It can copy blocks of memory at once e.g. 8 bytes or more without a bound check on each access. – Peter Lawrey Oct 29 '13 at 20:15
  • i've updated the implementation to use just int view buffers. this way you'll avoid byte packing and unpacking so it should be very performant even without using the unsafe class – Matej Tymes Oct 29 '13 at 20:21
  • I was wondering whether the longer read/write times for `DirectByteBuffer` vs using the Unsafe directly, is caused by the fact that with reads, `DirectByteBuffer` brings the results to heap memory, while explicit use of `Unsafe` doesn't? – Bober02 Mar 07 '14 at 14:46
  • Also, in this case, wouldn't Heap perform better than calling `Unsafe`? – Bober02 Mar 11 '14 at 19:06
  • @Bober02 If you use Unsafe directly it will be as fast as using the heap. It is not safe, nor natural, nor productive, and not as easy to maintain but can be as fast or faster (as you can/have to control the layout of your data) Off heap memory also reduces the impact of GCs. – Peter Lawrey Mar 11 '14 at 21:39
  • Based on these two articles and the actual tests which I run: http://mentablog.soliveirajr.com/2012/11/which-one-is-faster-java-heap-or-native-memory/ and http://ashkrit.blogspot.co.uk/2013/07/which-memory-is-faster-heap-or.html (code here: https://github.com/ashkrit/blog/tree/master/allocation) I am getting confused - in the first article the results point towards the heap, and the author proves that both in accessing objects as well as contiguous arrays. The second article performing a test similar to the first test in the first article, proves the opposite... Any comments why? – Bober02 Mar 11 '14 at 21:53
  • 1
    @Bober02 There is no guarantees either way. A lot depends on what you test and even the system you use. On one machine for a test I get off heap being twice as fast and on another system on heap was 5x faster. What you can say is that off heap can reduce garbage depending on how you do it. – Peter Lawrey Mar 11 '14 at 23:22
  • 1
    The part about Android not having intrinsics is incorrect. Too lazy to search Dalvik sources, but at least on ART most get/put methods of direct Buffers delegate to intrinsics (to be precise, they use internal `Memory` class, which is in turn [implemented via intrinsics](https://android.googlesource.com/platform/art/+/6cff09a873e0179f2a8d28727d4cd2447bd1bf16/compiler/optimizing/intrinsics.cc#180)) – user1643723 Mar 17 '17 at 04:55
  • @user1643723 thank you for the correction. It might have been added at some point in the last 5 years. I hadn't checked either. – Peter Lawrey Mar 17 '17 at 13:06
2

A direct buffer holds the data in JNI land, so get() and put() have to cross the JNI boundary. A non-direct buffer holds the data in JVM land.

So:

  1. If you aren't playing with the data at all in Java land, e.g. just copying a channel to another channel, direct buffers are faster, as the data never has to cross the JNI boundary at all.

  2. Conversely, if you are playing with the data in Java land, a non-direct buffer will be faster. Whether its significant depends on how much data has to cross the JNI boundary and also on what quanta are transferred each time. For example, getting or putting a single byte at a time from/to a direct buffer could get very expensive, where getting/putting 16384 bytes at a time would amortize the JNI boundary cost considerably.

To answer your second paragraph, I would use a local byte[] array, not a thread-local, but then if I was playing with the data in Java land I wouldn't use a direct byte buffer at all. As the Javadoc says, direct byte buffers should only be used where they deliver a measurable performance benefit.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • Thanks , My message size is typically 256 bytes which i want to write to socket , I was thinking on encoding the bytes into a thread local byte[] array and then copy the byte array into a direct bytebuffer and pass the direct bytebuffer to the socket channel for writing . – user882659 Jun 24 '12 at 02:20
  • the direct bytebuffer are pooled . would this be preferable or do you recommended encoding the message directly into the direct bytebuffer rather than using the temporary byte array – user882659 Jun 24 '12 at 02:22
  • @user882659 See edit. There is no benefit to using a direct buffer at all in this case. – user207421 Jun 24 '12 at 02:23
  • @EJP - I spent a few minutes looking at the Java 7 source code, and I couldn't see where `get` and `put` on a direct buffer does JNI calls. Could you point out where in the code it does this? – Stephen C Jun 24 '12 at 02:29
  • @StephenC See the Javadoc. It is impossible, given the description and the purpose for which they are intended, that a direct buffer *doesn't* do JNI calls. – user207421 Jun 24 '12 at 02:35
  • @EJP , One question i have is , once the code is JIT compiled , is the JVM still crossing JNI boundary for direct byte buffer . I would have expected JVM to have done something smart. – user882659 Jun 24 '12 at 03:16
  • The reference you linked to doesn't say that you do a JNI call in `get` or `put`. It only mentions JNI to say that something *eventually* calls `sun.nio.ch.FileDispatcherImpl.write0` ... presumably when the buffer is full or you explicitly flush. Also, this isn't a reference to the code. It is a reference to some guy's sketchy commentary on the code. – Stephen C Jun 24 '12 at 06:09
  • http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/DirectByteBuffer.java#DirectByteBuffer – user882659 Jun 24 '12 at 06:29
  • @user882659 That suggestion postulates that HotSpot knows how to optimize JNI code, or at least calls to it, as well as Java bytecode. I'm not aware of any evidence to that effect. – user207421 Jun 24 '12 at 10:00
  • @StephenC If you want to postulate that there is no JNI aspect to direct byte buffers, and that they perform identically to non-direct byte buffers, you need to provide (1) an explanation of how direct byte buffers differ from non-direct byte buffers that does not involve JNI, and that doesn't contradict the Javadoc, and (2) a benchmark that exhibits your performance claim. – user207421 Jun 24 '12 at 10:09
  • @EJP - I'm not postulating anything. I'm merely pointing out that the reference you claimed shows JNI is used does not do anything of the kind. In fact Peter Lawley has provided strong evidence that JNI is **not** used; i.e. that the JIT compiler optimizes the Unsafe.getXxx calls to side-step JNI. – Stephen C Jun 24 '12 at 10:35
  • @StephenC What exactly do the statements "Java virtual machine will make a best effort to perform native I/O" and "The contents of direct buffers may reside outside of the normal garbage-collected heap" refer if there isn't a JNI aspect? – user207421 Aug 30 '12 at 05:07
  • I refer you back to my comment of "Jun 24 at 6:09". Clearly JNI is happening at some point to do the actual I/O. Put there is no evidence that `get` or `put` are doing JNI calls to read / write the data in the buffer. There are other ways to do it. Neither of those two statements contradict this. – Stephen C Aug 30 '12 at 05:52
  • get/put passes no borders. It's compiled native code for something like memcpy(arr, &value, sizeof(value)); After JITting, it's assembly speed. No need for JNI in 2015. – Kr0e Mar 13 '15 at 08:47
  • @Kr0e Nonsense. If what you claim was true, direct byte buffers wouldn't need to exist. – user207421 Mar 17 '17 at 09:21