What makes reading from files without buffers so expensive?

Question

Recently, I've created an interface which forces the user to implement a single fromStream(OutputStream) with its default methods looking like this:

public default T fromFile(File file) throws IOException {
    try (InputStream stream = new FileInputStream(file)) {
        return fromStream(stream);
    }
}

Soon after it turned out that this was very expensive (several seconds per MB) due to single bytes being directly read from the FileInputStream.

Wrapping it in a BufferedInputStream solved my problem but it left me with the question why FileInputStream is so insanely expensive.

The file channel doesn't get closed or opened when reading bytes, so why is there a need for buffers in the first place?

Stephen C · Accepted Answer · 2017-02-04T01:32:54.900

If you read bytes from an unbuffered stream using the read() method, the JVM will end up making repeated read syscalls to the OS to read a single byte from the file. (Under the hood, the JVM is probably calling read(addr, offset, count) with a count of 1.)

The cost of making a syscall is large. At least a couple of orders of magnitude more than a regular method call. This is because there are significant overheads in:

Switching contexts between the application (unprivileged) security domain and the system (privileged) security domain. The register set needs to be saved, virtual memory mappings need to be changed, TLB entries need to be flushed, etc.
The OS has to do various extra things to ensure that what syscall is requesting is legitimate. In this case, the OS has to figure out whether the requested offset and count are OK given the current file position and size, whether the address is within the application's address space, and map as writeable. And so on.

By contrast, if you use a buffered stream, the stream will try to read the file from the OS in large chunks. That typically results in a many-thousand-fold reduction in the number of syscalls.

In fact, this is NOT about how files are stored on disk. It is true that data ultimately has to be read a block at a time, etc. However, the OS is smart enough to do its own buffering. It can even read-ahead parts of the file so that they are in (kernel) memory ready for the application when it makes the syscall to read them.

It is extremely unlikely that multiple one byte read() calls will result in extra disk traffic. The only scenario where this is plausible is if you wait a long time between each read() ... and the OS reuses the space where it was caching the disk block.

score 1 · Answer 2 · answered Feb 04 '17 at 01:12

1

When you read from a file, you have to read it a block at a time, because that's the only quantity the hardware supports. If you were to read a character at a time without buffering , then supposing 512B block, you would read the same block 512 times to read the whole block. If you read and buffered, you would access the disk once then read from memory.

Accessing disk is several orders of magnitude slower than accessing memory, o this isn't a a great idea.

answered Feb 04 '17 at 01:12

awiebe

3,758
4
22
33

Well, a very poor OS could do it that way, but every OS that has a Java implementation has internal OS file system caches that buffer the block of data from disk, so the block won't be read again. But even to read from the file system cache requires a system call, which has a great overhead compared to anything that stays inside the user process. See Stephen's answer. – Erwin Bolwidt Feb 04 '17 at 03:45

What makes reading from files without buffers so expensive?

2 Answers2