What goes on behind the curtains during disk I/O?

Question

When I seek to some position in a file and write a small amount of data (20 bytes), what goes on behind the scenes?

My understanding

To my knowledge, the smallest unit of data that can be written or read from a disk is one sector (traditionally 512 bytes, but that standard is now changing). That means to write 20 bytes I need to read a whole sector, modify some of it in memory and write it back to disk.

This is what I expect to be happening in unbuffered I/O. I also expect buffered I/O to do roughly the same thing, but be clever about its cache. So I would have thought that if I blow locality out the window by doing random seeks and writes, both buffered and unbuffered I/O ought to have similar performance... maybe with unbuffered coming out slightly better.

Then again, I know it's crazy for buffered I/O to only buffer one sector, so I might also expect it to perform terribly.

My application

I am storing values gathered by a SCADA device driver that receives remote telemetry for upwards of a hundred thousand points. There is extra data in the file such that each record is 40 bytes, but only 20 bytes of that needs to be written during an update.

Pre-implementation benchmark

To check that I don't need to dream up some brilliantly over-engineered solution, I have run a test using a few million random records written to a file that could contain a total of 200,000 records. Each test seeds the random number generator with the same value to be fair. First I erase the file and pad it to the total length (about 7.6 meg), then loop a few million times, passing a random file offset and some data to one of two test functions:

void WriteOldSchool( void *context, long offset, Data *data )
{
    int fd = (int)context;
    lseek( fd, offset, SEEK_SET );
    write( fd, (void*)data, sizeof(Data) );
}

void WriteStandard( void *context, long offset, Data *data )
{
    FILE *fp = (FILE*)context;
    fseek( fp, offset, SEEK_SET );
    fwrite( (void*)data, sizeof(Data), 1, fp );
    fflush(fp);
}

Maybe no surprises?

The OldSchool method came out on top - by a lot. It was over 6 times faster (1.48 million versus 232000 records per second). To make sure I hadn't run into hardware caching, I expanded my database size to 20 million records (file size of 763 meg) and got the same results.

Before you point out the obvious call to fflush, let me say that removing it had no effect. I imagine this is because the cache must be committed when I seek sufficiently far away, which is what I'm doing most of the time.

So, what's going on?

It seems to me that the buffered I/O must be reading (and possibly writing all of) a large chunk of the file whenever I try to write. Because I am hardly ever taking advantage of its cache, this is extremely wasteful.

In addition (and I don't know the details of hardware caching on disk), if the buffered I/O is trying to write a bunch of sectors when I change only one, that would reduce the effectiveness of the hardware cache.

Are there any disk experts out there who can comment and explain this better than my experimental findings? =)

I assume that fseek causes a fflush (thus the lack of difference) because the buffer must be flushed before it can be moved. — dave, Nov 01 '12 at 05:21
The read, write will be of the block size defined by file system. On linux ext4, default is 4KB. Similarly, file cache (buffered files) works on PAGE_SIZE unit, again most likely 4KB. — Rohan, Nov 01 '12 at 05:34
You can disable the stdio-level buffering by calling `setbuf(fp, NULL);` after opening `fp`. — caf, Nov 01 '12 at 06:45
As an aside, sizeof(Data) is the size of the pointer, not the size of the data. — janneb, Nov 01 '12 at 07:36
@janneb Oops sorry this was actually compiled as a C++ project but since it deals essentially with C functions I labelled it as such. I used C++ syntax for the sizeof. — paddy, Nov 01 '12 at 08:38

score 5 · Accepted Answer · answered Nov 01 '12 at 05:12

5

Indeed, at least on my system with GNU libc, it looks like stdio is reading 4kB blocks before writing back the changed portion. Seems bogus to me, but I imagine somebody thought it was a good idea at the time.

I checked by writing a trivial C program to open a file, write a small of data once, and exit; then ran it under strace, to see which syscalls it actually triggered. Writing at an offset of 10000, I saw these syscalls:

lseek(3, 8192, SEEK_SET)                = 8192
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1808) = 1808
write(3, "hello", 5)                    = 5

Seems that you'll want to stick with the low-level Unix-style I/O for this project, eh?

answered Nov 01 '12 at 05:12

Jamey Sharp

8,363
2
29
42

3

4kB is probably the block size of your filesystem (it's default in ext4 I think). You could change that if you want. – dave Nov 01 '12 at 05:20
1

@dave, it is true that the filesystem block size was 4kB in this test--and so is my memory page-size--but I didn't ask libc to read at all. In this case, there isn't any reason to do so. Using the raw `lseek`/`write` syscalls works fine without extra kernel-user memory copies. – Jamey Sharp Nov 01 '12 at 05:27
I have a theory why it's doing that read: since the block size of the fs is 4kB the only way to guarantee that another process hasn't modified the file is to read it in and then write it back out again. Using write bypasses this check. – dave Nov 01 '12 at 05:33
2

@dave: Huh, that's an interesting theory. I don't get it though: read-modify-write cycles introduce _more_ data-races, not fewer. The underlying syscalls provide a certain degree of atomicity that only the kernel can offer. – Jamey Sharp Nov 01 '12 at 05:37
This is really cool, thanks for doing that. So could it be that during disk access the write head can be switched on and off partway through a block? *ie* read 1808 bytes, then flick into write mode to dump 5 `hello` bytes while those bytes are underneath the head, then ignore the rest? Or is this still at a high(ish)-level, where the generated machine code would actually read 1808 bytes and then write out 1813 bytes? – paddy Nov 01 '12 at 08:56
1

@paddy: Oh, this is totally all discussing what happens in userspace. The kernel still has to read the whole block off disk, modify it with your new writes, and (sometime later) write the changes back. However, the kernel can do that efficiently and safely, while userspace generally can't. – Jamey Sharp Nov 01 '12 at 19:43

bdonlan · Answer 2 · 2012-11-02T02:46:04.143

5

The C standard library functions perform additional buffering, and are generally optimized for streaming reads, rather than random IO. On my system, ~~I don't observe the spurious reads that Jamey Sharp saw~~ I only see spurious reads when the offset is not aligned to a page size - it could be that the C library always tries to keep its IO buffer aligned to 4kb or something.

In your case, if you're doing lots of random reads and writes across a reasonably small dataset, you'd likely be best served using pread/pwrite to avoid having to make seeking syscalls, or simply mmaping the dataset and writing to it in memory (likely to be the fastest, if your dataset fits in memory).

edited Nov 02 '12 at 02:46

answered Nov 01 '12 at 06:32

bdonlan

224,562
31
268
324

1

Try fseeking to an offset that is not a multiple of 4kB or other likely block sizes--that's why I used offset 10,000. I suspect you'll see the same spurious read that I did. – Jamey Sharp Nov 01 '12 at 07:45
Thanks, I didn't know about `pread`/`pwrite`. I am actually on a Windows platform (I had tagged as VS-2010, but that was removed - I guess I should have just stated it in my question). I found this (http://stackoverflow.com/questions/766477/are-there-equivalents-to-pread-on-different-platforms) which suggests that async I/O with the `CreateFile` API might be the way to go. Something I've always shied away from. The throughput of `lseek` + `write` is more than adequate for me... With the admittance that I did this on a system with a 2-drive RAID-0 configuration. – paddy Nov 01 '12 at 09:01
@paddy, async I/O is one of the few places I'm envious of Windows. (Linux is terrible at it; pretty much the only cases that work are the ones Oracle needs.) If `lseek` + `write` is fast enough for you, I'd definitely do that--but AIO is worth considering if you hit a wall. – Jamey Sharp Nov 01 '12 at 19:47
@JameySharp I will have a separate thread whose sole purpose is to process a work queue that contains all the messages to be written to disk. In that respsect, I'm not sure what advantage asynchronous I/O would provide. I think I'll still do some tests (but maybe after I get the job done) - at the very least it'll force me to learn something new. – paddy Nov 01 '12 at 21:13

What goes on behind the curtains during disk I/O?

2 Answers2

Linked