Measuring performance of java.io.InputStream

Question

I have a file 5GB in size which I want to read by chunks, say 2MB. Using java.io.InputStream works fine. So I measured this thing as follows:

static final byte[] buffer = new byte[2 * 1024 * 1024];

public static void main(String args[]) throws IOException {
    while(true){
        InputStream is = new FileInputStream("/tmp/log_test.log");
        long bytesRead = 0;
        int readCurrent;
        long start = System.nanoTime();
        while((readCurrent = is.read(buffer)) > 0){
            bytesRead += readCurrent;
        }
        long end = System.nanoTime();
        System.out.println(
            "Bytes read = " + bytesRead + ". Time elapsed = " + (end - start)
        );
    }
}

RESULT = 2121714428

It can be seen that averagely it takes 2121714428 nanos. It is so because the implementation does (*env)->SetByteArrayRegion(env, bytes, off, nread, (jbyte *)buf); of the data read into a malloced or stack allocated buffer as shown here. So memcpy takes pretty large amount of CPU time:

Since JNI spec defines that

Inside a critical region, native code must not call other JNI functions, or any system call that may cause the current thread to block and wait for another Java thread. (For example, the current thread must not call read on a stream being written by another Java thread.)

I don't see any problems to do read from a regular file within a critical section. Reading from a regular file is blocked only briefly and does not depend on any java thread. Something like this:

static final byte[] buffer = new byte[2 * 1024 * 1024];

public static void main(String args[]) throws IOException {
    while (true) {
        int fd = open("/tmp/log_test.log");
        long bytesRead = 0;
        int readCurrent;
        long start = System.nanoTime();
        while ((readCurrent = read(fd, buffer)) > 0) {
            bytesRead += readCurrent;
        }
        long end = System.nanoTime();
        System.out.println("Bytes read = " + bytesRead + ". Time elapsed = " + (end - start));
    }
}

private static native int open(String path);

private static native int read(int fd, byte[] buf);

JNI functions:

JNIEXPORT jint JNICALL Java_com_test_Main_open
  (JNIEnv *env, jclass jc, jstring path){
    const char *native_path = (*env)->GetStringUTFChars(env, path, NULL);
    int fd = open(native_path, O_RDONLY);
    (*env)->ReleaseStringUTFChars(env, path, native_path);
    return fd;
}


JNIEXPORT jint JNICALL Java_com_test_Main_read
  (JNIEnv *env, jclass jc, jint fd, jbyteArray arr){
    size_t java_array_size = (size_t) (*env)->GetArrayLength(env, arr);
    void *buf = (*env)->GetPrimitiveArrayCritical(env, arr, NULL);
    ssize_t bytes_read = read(fd, buf, java_array_size);
    (*env)->ReleasePrimitiveArrayCritical(env, arr, buf, 0);
    return (jint) bytes_read;
}

RESULT = 1179852225

Runnning this in a loop it takes averagely 1179852225 nanos which is almost twice more efficient.

Question: What's the actual problem with reading from a regular file within critical section?

`InputStream` is not buffered, there is context switching and reading the file system is a (potentially) blocking operation. Instead of JNI, you should probably use [**nio**](https://en.wikipedia.org/wiki/Non-blocking_I/O_(Java)) to read your files. — Elliott Frisch, May 31 '19 at 23:54
@ElliottFrisch Sure, `DirectByteBuffer` is not suffered from additional copying since it resides in C heap and its content is not reachable by GC, but the question is why can't we use critical methods when using `byte[]` and `java.io`? — Some Name, May 31 '19 at 23:55
@ElliottFrisch I'm confused by the wording of the JNI spec. What syscall are allowed to be called within critical sections... ? — Some Name, May 31 '19 at 23:57
Any that will not *cause the current thread to block and wait for another Java thread*; mainly because you can crash the JVM that way (the other thread doesn't know to notify the blocking native thread). — Elliott Frisch, Jun 01 '19 at 00:14
@ElliottFrisch Considering the worst case scenario if the `read` is done inside a critical region by some thread and another thread ran `OutOfMemory`. Does it mean that the blocked thread will not be notified and the JVM is not terminated (as it should have been)? — Some Name, Jun 01 '19 at 00:22
In that case, I suspect (but have not tested) that the native thread will never wake up (but the OS will clean it up when the JVM terminates). — Elliott Frisch, Jun 01 '19 at 00:23
@ElliottFrisch Got it, thanks. But maybe you know the workaround? The reason I don't use `nio` is that I am obligated to work with `byte[]`. Reading into `HeapByteBuffer` involves additional copy from temporary `DirectByteBuffer` so it does not really change the things. — Some Name, Jun 01 '19 at 00:27
Using a `BufferedInputStream` won't help with large reads. Basically, if you can't use NIO you are of out of *safe* options for making the reads go faster. — Stephen C, Jun 01 '19 at 01:33
@StephenC I found an article about the criticals https://shipilev.net/jvm/anatomy-quarks/9-jni-critical-gclocker/ and observed that calling JNI functions is really hazardous. But probably the "critical read" like this is not that danger... — Some Name, Jun 01 '19 at 01:57
I wouldn't use something that is "probably not dangerous" in production code. You need to be *sure*. But ... hey ... we can only offer advice. The responsibility is yours. — Stephen C, Jun 01 '19 at 02:23
But here is a "for instance". Consider what happens if the file you are reading is on a network share or an NFS server. In that scenario, the OS may need to fetch the data over the network. It may take a long time. Indeed, it may even take an indefinitely long time ... if the server goes down, etc. All the while, your code is in the "critical region", and potentially blocking the GC, etcetera. — Stephen C, Jun 01 '19 at 02:31
Even “reading directly into the array” likely is a “copy from the underlying fs buffer” operation, so when you are confident that the source is a local file, you might prefer using memory mapping, combined with copying to a `HeapByteBuffer`. That still bears the unavoidable copy operation, but avoids the additional copy operation introduced by an intermediate `DirectByteBuffer`. — Holger, Jun 03 '19 at 11:33
@Holger _likely is a “copy from the underlying fs buffer”_ I would say definitely.... — Some Name, Jun 03 '19 at 12:59

score 2 · Accepted Answer · answered Jun 01 '19 at 15:01

2MB buffer with FileInputStream is probably not the best choice. See this question for details. Although it was on Windows, I've seen a similar performance issue on Linux. Depending on the OS, allocating a temporary large buffer may result in extra mmap calls and subsequent page faults. Also such a large buffer makes L1/L2 caches useless.

Reading from a regular file is blocked only briefly and does not depend on any java thread.

This is not always true. In your benchmark the file is apparently cached in OS page cache and no device I/O happens. Accessing the real hardware (especially a spinning disk) can be orders of magnitude slower. The worst time of disk I/O is not fully predictable - it can be as large as hundreds of milliseconds, depending on the hardware condition, the length of I/O queue, the scheduling policy and so on.

The problem with JNI critical section is whenever a delay happens, it may affect all threads, not only the one doing I/O. This is not an issue for a single-threaded application, but this may cause undesirable stop-the-world pauses in a multi-threaded app.

The other reason against JNI critical is JVM bugs related to GCLocker. Sometimes they may cause redundant GC cycles or ignoring certain GC flags. Here are some examples (still not fixed):

JDK-8048556 Unnecessary GCLocker-initiated young GCs
JDK-8057573 CMSScavengeBeforeRemark ignored if GCLocker is active
JDK-8057586 Explicit GC ignored if GCLocker is active

So, the question is whether you care about throughput or latency. If you need only higher throughput, JNI critical is probably the right way to go. However, if you also care about predictable latency (not the average latency, but say, 99.9%) then JNI critical does not seem like the good choice.

_In your benchmark the file is apparently cached in OS page cache and no device I/O happens_. That was the reason. `perf stat -e 'block:*'` recorded no interactions with the physical device. After clearing caches the difference in performance was less then 10%. But I suppose that no disk I/O is likely to happen even though some process just write data without any kind of `sync` and then another process will try to read them. (I tested it as manually adding to the file some data and then running a process reading it with `perf stat -e 'block:*'`. No interactions were recorded). — Some Name, Jun 01 '19 at 22:52
It's worth pointing out that OP reads the file from "/tmp" in the question; that ***might*** be `tmpfs` (and RAM backed). — Elliott Frisch, Jun 02 '19 at 16:01
@ElliottFrisch It is. No major faults occur when I was benchmarking it. But I don't think that's a problem. — Some Name, Jun 03 '19 at 14:04

Measuring performance of java.io.InputStream

1 Answers1

Linked