8

If I read bytes from a file into a byte[] I see that FileInputStream performance worse when the array is around 1 MB compared to 128 KB. On the 2 workstations I have tested it is almost twice as fast with 128 KB. Why is that?

import java.io.*;

public class ReadFileInChuncks 
{
    public static void main(String[] args) throws IOException 
    {
        byte[] buffer1 = new byte[1024*128];
        byte[] buffer2 = new byte[1024*1024];

        String path = "some 1 gb big file";

        readFileInChuncks(path, buffer1, false);

        readFileInChuncks(path, buffer1, true);
        readFileInChuncks(path, buffer2, true);
        readFileInChuncks(path, buffer1, true);
        readFileInChuncks(path, buffer2, true);
    }

    public static void readFileInChuncks(String path, byte[] buffer, boolean report) throws IOException
    {
        long t = System.currentTimeMillis();

        InputStream is = new FileInputStream(path);
        while ((readToArray(is, buffer)) != 0) {}

        if (report)
            System.out.println((System.currentTimeMillis()-t) + " ms");
    }

    public static int readToArray(InputStream is, byte[] buffer) throws IOException
    {
        int index = 0;
        while (index != buffer.length)
        {
            int read = is.read(buffer, index, buffer.length - index);
            if (read == -1)
                break;
            index += read;
        }
        return index;
    }
}

outputs

422 ms 
717 ms 
422 ms 
718 ms

Notice this is a redefinition of an already posted question. The other was polluted with unrelated discussions. I will mark the other for deletion.

Edit: Duplicate, really? I sure could make some better code to proof my point, but this does not answer my question

Edit2: I ran the test with every buffer between 5 KB and 1000 KB on
Win7 / JRE 1.8.0_25 and the bad performance starts at precis 508 KB and all subsequent. Sorry for the bad diagram legions, x is buffer size, y is milliseconds

enter image description here

Community
  • 1
  • 1
Stig
  • 1,974
  • 2
  • 23
  • 50
  • 1
    i ran the same code twice to a big >4gb hprof file. and found results which are opposite to what you have mentioned. 8358 ms 6302 ms 7986 ms 6256 ms and 8591 ms 6326 ms 8022 ms 6268 ms – anurag gupta Feb 24 '15 at 19:22
  • On what OS/JRE combo is that? – fge Feb 24 '15 at 19:23
  • Win7 / JRE 1.8.0_25. And I see the same distinct difference on both SSD and HD. – Stig Feb 24 '15 at 19:29
  • How many times did you execute this? Is this your actual code or did you use a microbenchmark? http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java – Brett Okken Feb 24 '15 at 19:36
  • @anurag gupta, seems to be related to the native file implementation. – Stig Feb 24 '15 at 19:36
  • @Brett Okken, no i did not put much effort into it. Nevertheless I think the difference is very distinct regardless of the order. – Stig Feb 24 '15 at 19:38
  • @Stig check your CPU L2 cache size - it may have the size of 512KB. On my system the drop occurs on 1000KB and my i5-3570 has a L2 cache of 1024KB... – Robert Mar 25 '15 at 14:48
  • @Robert, it seems to be the difference with HeapCreate on 32bit vs 64bit. – Stig Mar 26 '15 at 08:26

4 Answers4

8

TL;DR The performance drop is caused by memory allocation, not by file reading issues.

A typical benchmarking problem: you benchmark one thing, but actually measure another.

First of all, when I rewrote the sample code using RandomAccessFile, FileChannel and ByteBuffer.allocateDirect, the threshold has disappeared. File reading performance became roughly the same for 128K and 1M buffer.

Unlike direct ByteBuffer I/O FileInputStream.read cannot load data directly into Java byte array. It needs to get data into some native buffer first, and then copy it to Java using JNI SetByteArrayRegion function.

So we have to look at the native implementation of FileInputStream.read. It comes down to the following piece of code in io_util.c:

    if (len == 0) {
        return 0;
    } else if (len > BUF_SIZE) {
        buf = malloc(len);
        if (buf == NULL) {
            JNU_ThrowOutOfMemoryError(env, NULL);
            return 0;
        }
    } else {
        buf = stackBuf;
    }

Here BUF_SIZE == 8192. If the buffer is larger than this reserved stack area, a temporary buffer is allocated by malloc. On Windows malloc is usually implemented via HeapAlloc WINAPI call.

Next, I measured the performance of HeapAlloc + HeapFree calls alone without file I/O. The results were interesting:

     128K:    5 μs
     256K:   10 μs
     384K:   15 μs
     512K:   20 μs
     640K:   25 μs
     768K:   29 μs
     896K:   33 μs
    1024K:  316 μs  <-- almost 10x leap
    1152K:  356 μs
    1280K:  399 μs
    1408K:  436 μs
    1536K:  474 μs
    1664K:  511 μs
    1792K:  553 μs
    1920K:  592 μs
    2048K:  628 μs

As you can see, the performance of OS memory allocation drastically changes at 1MB boundary. This can be explained by different allocation algorithms used for small chunks and for large chunks.

UPDATE

The documentation for HeapCreate confirms the idea about specific allocation strategy for blocks larger than 1MB (see dwMaximumSize description).

Also, the largest memory block that can be allocated from the heap is slightly less than 512 KB for a 32-bit process and slightly less than 1,024 KB for a 64-bit process.

...

Requests to allocate memory blocks larger than the limit for a fixed-size heap do not automatically fail; instead, the system calls the VirtualAlloc function to obtain the memory that is needed for large blocks.

apangin
  • 92,924
  • 10
  • 193
  • 247
  • in the last comment to the question @Robert mentioned L2 cache of 1024kb, do you think it may be the reason behind the malloc jump? – Zielu Mar 25 '15 at 23:24
  • @Zielu No, I'm absolutely sure it can't be the reason for 10x difference. Actually, in my HeapAlloc test the allocated memory isn't even used, so the cache does not matter. In fact, I've already found the reason, see the update. – apangin Mar 26 '15 at 00:24
  • BUFFER_SIZE is 8192. The problem does not surface till the buffer is multiple times bigger, which means that malloc is being used without a problem. Am I missing something? – Jose Martinez Mar 26 '15 at 01:04
  • @JoseMartinez malloc(500K) is a way faster than malloc(1024K), because the former allocates from the process heap while the latter results in VirtualAlloc call that asks kernel to map more pages into the process address space. – apangin Mar 26 '15 at 06:53
  • Great work, this must be the reason. I guess your machine is 64 bit and the graph I did was on a 32 bit machine. – Stig Mar 26 '15 at 07:41
  • Thank you for dedicating your time to it. Well done. – Zielu Mar 26 '15 at 15:21
1

Optimal buffer size depands on file system block size, CPU cache size and cache latency. Most os'es use block size 4096 or 8192 so it is recommended to use buffer with this size or multiplicity of this value.

Piotr Zych
  • 483
  • 4
  • 19
  • As well, there is typically very little, if any, gain beyond a 4K. Making huge buffers is tremendously wasteful and comes back to bite when you suddenly run the code at scale (and now there's 10,000 of these instead of a handful). Default JVM buffers are 128 bytes, though I find that a little too small. – Lawrence Dol Feb 24 '15 at 19:42
  • @Piotr Zych, multiplicity of 4096 or 8192 seems not to be the cure. See the graph in the question. – Stig Feb 25 '15 at 07:06
  • I assume in this case the CPU cache size may be the main problem. The main memory is much slower than the CPU L2 cache. – Robert Mar 25 '15 at 14:42
0

I rewrote the test to test different sizes of buffer.

Here is the new code:

public class ReadFileInChunks {

    public static void main(String[] args) throws IOException {
        String path = "C:\\\\tmp\\1GB.zip";
        readFileInChuncks(path, new byte[1024 * 128], false);

        for (int i = 1; i <= 1024; i+=10) {
            readFileInChuncks(path, new byte[1024 * i], true);
        }
    }

    public static void readFileInChuncks(String path, byte[] buffer, boolean report) throws IOException {
        long t = System.currentTimeMillis();

        InputStream is = new FileInputStream(path);
        while ((readToArray(is, buffer)) != 0) {
        }

        if (report) {
            System.out.println("buffer size = " + buffer.length/1024 + "kB , duration = " + (System.currentTimeMillis() - t) + " ms");
        }
    }

    public static int readToArray(InputStream is, byte[] buffer) throws IOException {
        int index = 0;
        while (index != buffer.length) {
            int read = is.read(buffer, index, buffer.length - index);
            if (read == -1) {
                break;
            }
            index += read;
        }
        return index;
    }

}

And here are the results...

buffer size = 121kB , duration = 320 ms
buffer size = 131kB , duration = 330 ms
buffer size = 141kB , duration = 330 ms
buffer size = 151kB , duration = 323 ms
buffer size = 161kB , duration = 320 ms
buffer size = 171kB , duration = 320 ms
buffer size = 181kB , duration = 320 ms
buffer size = 191kB , duration = 310 ms
buffer size = 201kB , duration = 320 ms
buffer size = 211kB , duration = 310 ms
buffer size = 221kB , duration = 310 ms
buffer size = 231kB , duration = 310 ms
buffer size = 241kB , duration = 310 ms
buffer size = 251kB , duration = 310 ms
buffer size = 261kB , duration = 320 ms
buffer size = 271kB , duration = 310 ms
buffer size = 281kB , duration = 320 ms
buffer size = 291kB , duration = 310 ms
buffer size = 301kB , duration = 319 ms
buffer size = 311kB , duration = 320 ms
buffer size = 321kB , duration = 310 ms
buffer size = 331kB , duration = 320 ms
buffer size = 341kB , duration = 310 ms
buffer size = 351kB , duration = 320 ms
buffer size = 361kB , duration = 310 ms
buffer size = 371kB , duration = 320 ms
buffer size = 381kB , duration = 311 ms
buffer size = 391kB , duration = 310 ms
buffer size = 401kB , duration = 310 ms
buffer size = 411kB , duration = 320 ms
buffer size = 421kB , duration = 310 ms
buffer size = 431kB , duration = 310 ms
buffer size = 441kB , duration = 310 ms
buffer size = 451kB , duration = 320 ms
buffer size = 461kB , duration = 310 ms
buffer size = 471kB , duration = 310 ms
buffer size = 481kB , duration = 310 ms
buffer size = 491kB , duration = 310 ms
buffer size = 501kB , duration = 310 ms
buffer size = 511kB , duration = 320 ms
buffer size = 521kB , duration = 300 ms
buffer size = 531kB , duration = 310 ms
buffer size = 541kB , duration = 312 ms
buffer size = 551kB , duration = 311 ms
buffer size = 561kB , duration = 320 ms
buffer size = 571kB , duration = 310 ms
buffer size = 581kB , duration = 314 ms
buffer size = 591kB , duration = 320 ms
buffer size = 601kB , duration = 310 ms
buffer size = 611kB , duration = 310 ms
buffer size = 621kB , duration = 310 ms
buffer size = 631kB , duration = 310 ms
buffer size = 641kB , duration = 310 ms
buffer size = 651kB , duration = 310 ms
buffer size = 661kB , duration = 301 ms
buffer size = 671kB , duration = 310 ms
buffer size = 681kB , duration = 310 ms
buffer size = 691kB , duration = 310 ms
buffer size = 701kB , duration = 310 ms
buffer size = 711kB , duration = 300 ms
buffer size = 721kB , duration = 310 ms
buffer size = 731kB , duration = 310 ms
buffer size = 741kB , duration = 310 ms
buffer size = 751kB , duration = 310 ms
buffer size = 761kB , duration = 311 ms
buffer size = 771kB , duration = 310 ms
buffer size = 781kB , duration = 300 ms
buffer size = 791kB , duration = 300 ms
buffer size = 801kB , duration = 310 ms
buffer size = 811kB , duration = 310 ms
buffer size = 821kB , duration = 300 ms
buffer size = 831kB , duration = 310 ms
buffer size = 841kB , duration = 310 ms
buffer size = 851kB , duration = 300 ms
buffer size = 861kB , duration = 310 ms
buffer size = 871kB , duration = 310 ms
buffer size = 881kB , duration = 310 ms
buffer size = 891kB , duration = 304 ms
buffer size = 901kB , duration = 310 ms
buffer size = 911kB , duration = 310 ms
buffer size = 921kB , duration = 310 ms
buffer size = 931kB , duration = 299 ms
buffer size = 941kB , duration = 321 ms
buffer size = 951kB , duration = 310 ms
buffer size = 961kB , duration = 310 ms
buffer size = 971kB , duration = 310 ms
buffer size = 981kB , duration = 310 ms
buffer size = 991kB , duration = 295 ms
buffer size = 1001kB , duration = 339 ms
buffer size = 1011kB , duration = 302 ms
buffer size = 1021kB , duration = 610 ms

It looks like some sort of threshold is hit at around 1021kB size buffer. Looking deeper into this I see...

buffer size = 1017kB , duration = 310 ms
buffer size = 1018kB , duration = 310 ms
buffer size = 1019kB , duration = 602 ms
buffer size = 1020kB , duration = 600 ms

So it looks like there is some sort of doubling effect is going on when this threshold is hit. My initial thoughts are that the readToArray while loop was looping double the amount of times when the threshold was hit, but that's not the case, the while loop only goes through one iteration whether 300ms run or 600ms run. So lets look at the actual io_utils.c that implements actually reads the data from the disk for some clues.

jint
readBytes(JNIEnv *env, jobject this, jbyteArray bytes,
          jint off, jint len, jfieldID fid)
{
    jint nread;
    char stackBuf[BUF_SIZE];
    char *buf = NULL;
    FD fd;

    if (IS_NULL(bytes)) {
        JNU_ThrowNullPointerException(env, NULL);
        return -1;
    }

    if (outOfBounds(env, off, len, bytes)) {
        JNU_ThrowByName(env, "java/lang/IndexOutOfBoundsException", NULL);
        return -1;
    }

    if (len == 0) {
        return 0;
    } else if (len > BUF_SIZE) {
        buf = malloc(len);
        if (buf == NULL) {
            JNU_ThrowOutOfMemoryError(env, NULL);
            return 0;
        }
    } else {
        buf = stackBuf;
    }

    fd = GET_FD(this, fid);
    if (fd == -1) {
        JNU_ThrowIOException(env, "Stream Closed");
        nread = -1;
    } else {
        nread = (jint)IO_Read(fd, buf, len);
        if (nread > 0) {
            (*env)->SetByteArrayRegion(env, bytes, off, nread, (jbyte *)buf);
        } else if (nread == JVM_IO_ERR) {
            JNU_ThrowIOExceptionWithLastError(env, "Read error");
        } else if (nread == JVM_IO_INTR) {
            JNU_ThrowByName(env, "java/io/InterruptedIOException", NULL);
        } else { /* EOF */
            nread = -1;
        }
    }

    if (buf != stackBuf) {
        free(buf);
    }
    return nread;
}

One thing to note is that BUF_SIZE is set to 8192. The doubling effect happens way above that. So the next culprit would be the IO_Read method.

windows/native/java/io/io_util_md.h:#define IO_Read handleRead

So we go to handleRead method.

windows/native/java/io/io_util_md.c:handleRead(jlong fd, void *buf, jint len)

This method hands of the request to a method called ReadFile.

JNIEXPORT
size_t
handleRead(jlong fd, void *buf, jint len)
{
    DWORD read = 0;
    BOOL result = 0;
    HANDLE h = (HANDLE)fd;
    if (h == INVALID_HANDLE_VALUE) {
        return -1;
    }
    result = ReadFile(h,          /* File handle to read */
                      buf,        /* address to put data */
                      len,        /* number of bytes to read */
                      &read,      /* number of bytes read */
                      NULL);      /* no overlapped struct */
    if (result == 0) {
        int error = GetLastError();
        if (error == ERROR_BROKEN_PIPE) {
            return 0; /* EOF */
        }
        return -1;
    }
    return read;
}

And this is where the trail runs cold.... for now. If I find the code for ReadFile I will take a look and post back.

Jose Martinez
  • 11,452
  • 7
  • 53
  • 68
  • All very interesting but this is not an answer. It should have been edited into your queston. – user207421 Mar 25 '15 at 23:11
  • 2
    @EJP as the author said it is work in progress and it seems to be a good one as another new answer builds on top of it. This question could not be answered in 5 minutes, so I believe it should be allowed to share one's investigation as long as it makes sense and this one does. – Zielu Mar 25 '15 at 23:21
  • 1
    @Jose thank you for the effort. You were the first one who addressed the problem from the right angle. If I could split the bounty I would do it, but there is no such option and Apangin managed to get to the bottom of it. – Zielu Mar 26 '15 at 15:17
  • The reward I got was the knowledge I gained from trying to solve it. Thanks for posting such a great question. – Jose Martinez Mar 26 '15 at 15:50
-1

This can be because of cpu cache,

cpu has its own cache memory and there is some fix size for that you can check your cpu cache size by executing this command on cmd

wmic cpu get L2CacheSize

suppose you have 256k as cpu cache size, So what happens is If you read 256k chunks or smaller, the content that was written to the buffer is still in the CPU cache when the read accesses it. If you have chunks greater of 256k then the last 256k that were read are in the CPU cache, so when the read starts from the beginning the content must be retrieved from main memory.

Keval
  • 1,857
  • 16
  • 26
  • Read the question carefully: "Win7 / JRE 1.8.0_25 and the bad performance starts at precis 508 KB and all subsequent" – Zielu Mar 25 '15 at 15:16