Strange method invocation optimization issue

Question

I have been testing problem with too slow DataInputStream.readByte() method working, and found interesting, but incomprehensible issue. I'm using jdk1.7.0_40, Windows 7 64 bit.

Consider we have some huge byte-array and reading data from it. And let's compare 4 methods for reading byte-by-byte from this array:

reading via simple loop
reading via ByteArrayInputStream -> DataInputStream
reading via ByteArrayInputStream -> our own DataInputStream implementation (MyDataInputStream)
reading via ByteArrayInputStream and copy of method readByte() from DataInputStream.

I have found following results (after long-time of test-loop iterating):

Loop took aprox. 312446094 ns
DataInputStream took apox. 2555898090 ns
MyDataInputStream took aprox. 2630664298 ns
Via method readByte() copy took 309265568 ns

In other words, we have strange optimization issue: the same operations via object method invocation take in 10 times longer work, then via "native" implementation.

The question: why?.

For information:

@Test
public void testBytes1() throws IOException {
    byte[] bytes = new byte[1_000_000_000];
    Random r = new Random();
    for (int i = 0; i < bytes.length; i++)
        bytes[i] = (byte) r.nextInt();

    do {
        System.out.println();

        bytes[r.nextInt(1_000_000_000)] = (byte) r.nextInt();

        testLoop(bytes);
        testDis(bytes);
        testMyDis(bytes);
        testViaMethod(bytes);
    } while (true);
}

private void testDis(byte[] bytes) throws IOException {
    long time1 = System.nanoTime();
    long c = 0;
    try (ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
         DataInputStream dis = new DataInputStream(bais)) {
        for (int i = 0; i < bytes.length; i++) {
            c += dis.readByte();
        }
    }
    long time2 = System.nanoTime();
    System.out.println("Dis: \t\t\t\t" + (time2 - time1) + "\t\t\t\t" + c);
}

private void testMyDis(byte[] bytes) throws IOException {
    long time1 = System.nanoTime();
    long c = 0;
    try (ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
         MyDataInputStream dis = new MyDataInputStream(bais)) {
        for (int i = 0; i < bytes.length; i++) {
            c += dis.readByte();
        }
    }
    long time2 = System.nanoTime();
    System.out.println("My Dis: \t\t\t" + (time2 - time1) + "\t\t\t\t" + c);
}

private void testViaMethod(byte[] bytes) throws IOException {
    long time1 = System.nanoTime();
    long c = 0;
    try (ByteArrayInputStream bais = new ByteArrayInputStream(bytes)
    ) {
        for (int i = 0; i < bytes.length; i++) {
            c += readByte(bais);
        }
    }
    long time2 = System.nanoTime();
    System.out.println("Via method: \t\t" + (time2 - time1) + "\t\t\t\t" + c);
}

private void testLoop(byte[] bytes) {
    long time1 = System.nanoTime();
    long c = 0;
    for (int i = 0; i < bytes.length; i++) {
        c += bytes[i];
    }
    long time2 = System.nanoTime();
    System.out.println("Loop: \t\t\t\t" + (time2 - time1) + "\t\t\t\t" + c);
}

public final byte readByte(InputStream in) throws IOException {
    int ch = in.read();
    if (ch < 0)
        throw new EOFException();
    return (byte)(ch);
}

static class MyDataInputStream implements Closeable {

    InputStream in;

    MyDataInputStream(InputStream in) {
        this.in = in;
    }

    public final byte readByte() throws IOException {
        int ch = in.read();
        if (ch < 0)
            throw new EOFException();
        return (byte)(ch);
    }

    @Override
    public void close() throws IOException {
        in.close();
    }
}

P.S. Update for thoose, who is in doubt about my results, this is printout, using -XX:+PrintCompilation -verbose:gc -XX:CICompilerCount=1

     37    1             java.lang.String::hashCode (55 bytes)
     41    2             java.lang.String::charAt (29 bytes)
     43    3             java.lang.String::indexOf (70 bytes)
     49    4             java.lang.AbstractStringBuilder::ensureCapacityInternal (16 bytes)
     52    5             java.lang.AbstractStringBuilder::append (29 bytes)
    237    6             java.util.Random::nextInt (7 bytes)
    237    9     n       sun.misc.Unsafe::compareAndSwapLong (native)   
    238    7             java.util.concurrent.atomic.AtomicLong::get (5 bytes)
    238    8             java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)
    239   10             java.util.Random::next (47 bytes)
    239   11 %           fias.TestArrays::testBytes1 @ 15 (77 bytes)
   9645   11 %           fias.TestArrays::testBytes1 @ -2 (77 bytes)   made not entrant

   9646   12 %           fias.TestArrays::testLoop @ 10 (77 bytes)
   9964   12 %           fias.TestArrays::testLoop @ -2 (77 bytes)   made not entrant
Loop:               318726397               -500090432
   9965   13             java.io.DataInputStream::readByte (23 bytes)
   9966   14  s          java.io.ByteArrayInputStream::read (36 bytes)
   9967   15 % !         fias.TestArrays::testDis @ 37 (279 bytes)
Dis:                2684374258              -500090432
  12651   16             fias.TestArrays$MyDataInputStream::readByte (23 bytes)
  12652   17 % !         fias.TestArrays::testMyDis @ 37 (279 bytes)
My Dis:             2675570541              -500090432
  15327   18             fias.TestArrays::readByte (20 bytes)
  15328   19 % !         fias.TestArrays::testViaMethod @ 23 (179 bytes)
Via method:         2367507141              -500090432

  17694   20             fias.TestArrays::testLoop (77 bytes)
  17699   21 %           fias.TestArrays::testLoop @ 10 (77 bytes)
Loop:               374525891               -500090567
  18069   22   !         fias.TestArrays::testDis (279 bytes)
Dis:                2674626125              -500090567
  20745   23   !         fias.TestArrays::testMyDis (279 bytes)
My Dis:             2671418683              -500090567
  23417   24   !         fias.TestArrays::testViaMethod (179 bytes)
Via method:         2359181776              -500090567

Loop:               315081855               -500090663
Dis:                2558738649              -500090663
My Dis:             2627056034              -500090663
Via method:         311692727               -500090663

Loop:               317813286               -500090778
Dis:                2565161726              -500090778
My Dis:             2630665760              -500090778
Via method:         314594434               -500090778

Loop:               313695660               -500090797
Dis:                2568251556              -500090797
My Dis:             2635236578              -500090797
Via method:         311882312               -500090797

Loop:               316781686               -500090929
Dis:                2563535623              -500090929
My Dis:             2638487613              -500090929
Via method:         313170789               -500090929

UPD-2: Here is benchmark and results kindly given by @maaartinus.

You say : "Loop took aprox. 312446094 ns". This real result can't be as precise as this, just because you don't really how many time your processor is running on this thread during this prediod (50%?, 100%?) and it may change a lot of thing. I recommend you to run this programm many times, stores times, and use the average value instead of one specific execution time. — Gwenc37, Jun 18 '14 at 08:26
Your benchmark does not allow for proper warmup + you have the overhead of the test framework which may bias the results. You should use [a proper benchmarking tool](http://stackoverflow.com/a/15787478/829571). — assylias, Jun 18 '14 at 08:27
Theese results are taken after several tens of testing loop iterations. If you can advice more accurate test - please, suggest it. — Andremoniy, Jun 18 '14 at 08:28
@assylias I have removed test framework, it gives the same result. — Andremoniy, Jun 18 '14 at 09:11
I could sort of confirm your results via Caliper: [benchmark](https://dl.dropboxusercontent.com/u/4971686/published/maaartin/so/AndremoniyBenchmark.java) and [results](https://microbenchmarks.appspot.com/runs/6eb1a8ca-980f-42c6-96df-a69a561a290a). So I guess, now someone might try to find an explanation. — maaartinus, Jun 18 '14 at 13:55

score 3 · Accepted Answer · answered Jun 19 '14 at 08:16

Surprisingly, reason is try-with-resources statement on MyDataInputStream/DataInputStream

if we move initialization inside try block performance will be like loop/method invocation

private void testMyDis(byte[] bytes) throws IOException {
    final long time1 = System.nanoTime();
    long c = 0;
    try (ByteArrayInputStream bais = new ByteArrayInputStream(bytes)) {
        final MyDataInputStream dis = new MyDataInputStream(bais);
        for (int i = 0; i < bytes.length; i++) {
            c += dis.readByte();
        }
    }
    final long time2 = System.nanoTime();
    System.out.println("My Dis: \t\t\t" + (time2 - time1) + "\t\t\t\t" + c);
}

I think that with that unnecessary resource JIT cannot use Range Check Elimination

That is awesome! +1 for that from me, but it is still not clear about `DataInputStream`: eliminating of `try-with-resources` block in test over it doesn't leed to speed up. — Andremoniy, Jun 19 '14 at 08:35
if you check `DataInputStream`'s code, you can found, that `InputStream` saved in `volatile` field, it can be a reason of difference, but anyway, without `try-with-resources` `DataInputStream` is faster than with it. — Alexander Kudrevatykh, Jun 19 '14 at 08:53

score -1 · Answer 2 · answered Jun 18 '14 at 09:39

The answer has been in the test. the extra cost is owe to function invocation. commonly we encourage to write short and clean functions instead of long functions and consider function invocation has very low cost. but the invocation cost is still larger than direct memory access.

in this case, for testloop, we can estimate a memory read costs ~3 ns ( includes integer operations, e.g. i++, c + ) for others, there's 2 additonal layers of function invocations.so each function calls is ~15 ns. actuality we can say function call is very fast.

the only point is that there's 2 000 000 000 functions calls in each process, that's really a large number.

there's another test case to prove the function call costs: do not use any stream, just read bytes with additional function calls:

add below function,

public final long getByte( long c, byte value, int dep ) {
    if ( dep > 0 ) {
        return getByte( c, value, dep - 1);
    }
    return c + value;
}

then invoke in testLoop like:

c = getByte( c, bytes[i], 2);

then the final cost increase to the same level:

Loop: 4044010718 -499870245

Dis: 5182272442 -499870245

My Dis: 5228065271 -499870245

Via method: 655108198 -499870245

What is wrong in your answer: 1) In both cases: via `dis.readByte()` and local object `readByte()` method calls - "function" call is used, so it doesn't explain the different. 2) Even if the reason relates to "function call", it doesn't explain why JVM optimizes only local method call. — Andremoniy, Jun 19 '14 at 06:52
@Andremoniy There are more function call in case of `dis.readByte()`. But you're right that they all can be inlined and they indeed do. OTOH, a long method body could prevent inlining or reduce loop unrolling. That said, even with Alexander's answer, it's still rather unclear what's going on. — maaartinus, Jun 19 '14 at 13:19

Strange method invocation optimization issue

2 Answers2