Why is file I/O in large chunks SLOWER than in small chunks?

Question

If you call ReadFile once with something like 32 MB as the size, it takes noticeably longer than if you read the equivalent number of bytes with a smaller chunk size, like 32 KB.

Why?

(No, my disk is not busy.)

Edit 1:

Forgot to mention -- I'm doing this with FILE_FLAG_NO_BUFFERING!

Edit 2:

Weird...

I don't have access to my old machine anymore (PATA), but when I tested it there, it took around 2 times as long, sometimes more. On my new machine (SATA), I'm only getting a ~25% difference.

Here's a piece of code to test:

#include <memory.h>
#include <windows.h>
#include <tchar.h>
#include <stdio.h>

int main()
{
    HANDLE hFile = CreateFile(_T("\\\\.\\C:"), GENERIC_READ,
        FILE_SHARE_READ | FILE_SHARE_WRITE, NULL,
        OPEN_EXISTING, FILE_FLAG_NO_BUFFERING /*(redundant)*/, NULL);
    __try
    {
        const size_t chunkSize = 64 * 1024;
        const size_t bufferSize = 32 * 1024 * 1024;
        void *pBuffer = malloc(bufferSize);

        DWORD start = GetTickCount();
        ULONGLONG totalRead = 0;
        OVERLAPPED overlapped = { 0 };
        DWORD nr = 0;
        ReadFile(hFile, pBuffer, bufferSize, &nr, &overlapped);
        totalRead += nr;
        _tprintf(_T("Large read: %d for %d bytes\n"),
            GetTickCount() - start, totalRead);

        totalRead = 0;
        start = GetTickCount();
        overlapped.Offset = 0;
        for (size_t j = 0; j < bufferSize / chunkSize; j++)
        {
            DWORD nr = 0;
            ReadFile(hFile, pBuffer, chunkSize, &nr, &overlapped);
            totalRead += nr;
            overlapped.Offset += chunkSize;
        }
        _tprintf(_T("Small reads: %d for %d bytes\n"),
            GetTickCount() - start, totalRead);
        fflush(stdout);
    }
    __finally { CloseHandle(hFile); }

    return 0;
}

Result:

Large read: 1076 for 67108864 bytes
Small reads: 842 for 67108864 bytes

Any ideas?

Have you tried swapping the order of operations (doing the 8K reads first then the 64M reads)? It might have to do with the layout of the disk. — Larry Osterman, May 18 '11 at 13:35
@Larry: Yeah, the small ones are still faster when I switch them. — user541686, May 18 '11 at 17:54
Does not answer your question, but might have some info that's interesting for you: http://stackoverflow.com/q/5909345/572743 (this is kind of the opposite of what you're doing, but with the same surprising results). — Damon, Jun 10 '11 at 19:48
I remember seeing the buffer size limit for optimal operation documented long ago, but not the reasons for it. I doubt I'll be able to find that documentation now. — Mark Ransom, Jun 10 '11 at 20:13

score 1 · Answer 1 · answered May 18 '11 at 07:47

This is not specific to windows. I did some tests a while back with the C++ iostream library and found there was an optimum buffer size for reads, above which performance degraded. Unfortunately, I no longer have the tests, and I can't remember what the size was :-). As to why, well there are a lot of issues, such as a large buffer possibly causing paging in other applications running at the same time (as the buffer can't be paged).

Not sure what you mean by the page buffer issue -- how is it different between large and small reads? — user541686, Jun 17 '11 at 05:35

Michael Burr · Answer 2 · 2011-05-18T08:19:16.480

When you perform the 1024 * 32KB reads are you reading into the same memory block over and over, or are you allocating a total of 32MB to rad into as well and filling the entire 32MB?

If you're reading the smaller reads into the same 32K block of memory, then the time difference is probably simply that Windows doesn't have to scavenge up the additional memory.

Update based on the FILE_FLAG_NO_BUFFERING addition to the question:

I'm not 100% certain, but I believe that when FILE_FLAG_NO_BUFFERING is used, Windows will lock the buffer into physical memory so it can allow the device driver to deal with physical addresses (such as to DMA directly into the buffer). It could (I believe) do this by breaking up a large request into smaller requests, but I suspect that Microsoft might have the philosophy that "if you ask for FILE_FLAG_NO_BUFFERING then we assume you know what you're doing and we're not going to get in your way".

Of course locking 32MB all at once instead of 32KB at a time will require more resources. So this would be kind of like my initial guess, but at the physical memory level rather than the virtual memory level.

However, since I don't work for MS and don't have access to Windows source, I'm going by vague recollection from times when I worked closer with the Windows kernel and device driver model (so this is more or less speculation).

No, I'm doing the *equivalent* operation, just in smaller chunks. i.e. both are read into the full 32 MB of memory. — user541686, May 18 '11 at 07:49
Just thought I'd check. By the way, do you have a test/benchmark that repros the situation that's also small enough to post? — Michael Burr, May 18 '11 at 07:54
I made one, but admittedly I can't reproduce the difference I saw on my old machine... :( — user541686, May 18 '11 at 08:22

MSN · Accepted Answer · 2011-06-10T20:04:15.740

1

Your test is including the time it take to read in file metadata, specifically, the mapping of file data to disk. If you close the file handle and re-open it, you should get similar timings for each. I tested this locally to make sure.

The effect is probably more severe with heavy fragmentation, as you have to read in more file to disk mappings.

EDIT: To be clear, I ran this change locally, and saw nearly identical times with large and small reads. Reusing the same file handle, I saw similar timings from the original question.

edited Jun 10 '11 at 20:04

answered Jun 10 '11 at 19:05

MSN

53,214
7
75
105

What if the files aren't fragmented, though? – user541686 Jun 14 '11 at 17:58
@Mehrdad, then it depends on how much latency you have between finishing one read and starting another. It's going to be pretty small though. – MSN Jun 14 '11 at 18:01
@MSN: Why is it that when I switched the order, the larger one was still slower? Doesn't that contradict the metadata reason? – user541686 Jun 17 '11 at 01:58
@Mehrdad, only if you closed and reopened the handle. The fact that the larger was slower by less than 200 ms is probably due to statistical noise. If it were truly slower, repeating the reads 50 times would reveal that. (In my tests subsequent reads of the same opened file were much faster.) – MSN Jun 17 '11 at 04:28
@MSN: I'm confused -- what does closing/reopening the handle have to do with anything? If my code works the same way in the reverse order, doesn't that mean metadata is not an issue? (If not, why?) Also -- 200 ms is **not** statistical "noise" when you're measuring something that takes 800 ms, especially when I get the same result with my code almost every time... – user541686 Jun 17 '11 at 04:46
@Mehrdad, Okay, try this: wrap all the code in the __try loop with a while (1) {...}. If the numbers are consistently 1000/800, then you have something weird going on. I will be a bit surprised if they are consistently 1000/800. – MSN Jun 17 '11 at 05:30
@MSN: [You might be surprised](http://i.imgur.com/RUdMV.png)... **Edit**: Oops, my bad, I didn't follow what you said closely enough. Let me fix it and I'll post the results... – user541686 Jun 17 '11 at 05:34
@MSN: **[I stand corrected.](http://i.imgur.com/XIFI9.png)** :) I still don't understand why, though... why is it not different when I switch them? – user541686 Jun 17 '11 at 05:43
@Mehrdad, without a lot of free time and something like ETW or the like, I can only guess. Maybe something to do with fragmentation of the file; if it's heavily fragmented, a single 64 MB read would require lots of roundtrips in a single I/O call, whereas multiple small reads would require less per I/O. But that's just a wild guess. – MSN Jun 17 '11 at 18:27
@MSN: Huh, okay; thanks for the answer anyhow, it's very enlightning! :) – user541686 Jun 17 '11 at 18:28
@Mehrdad, Yep. Optimizing I/O is sometimes a very enlightening process. Also a fairly frustrating one. – MSN Jun 17 '11 at 19:46

score 0 · Answer 4 · answered May 18 '11 at 08:01

0

when you have done FILE_FLAG_NO_BUFFERING that means that the operating system will not buffer the I/O. So each time you call the read function it will make a system call which will fetch each time the data from the disk. Then to read one file with a fixed size if you use less buffer size then more system calls are needed so more user space to kernel space and for each time a disk I/O is initiated. Instead if you use larger block size then for the same file size to be read there would be less system calls required so the user to kernel space switches would be lesser, and the number of times the disk i/O initiated will also be lesser. This is why, generally larger block will require less time to read.

Try reading the file only 1 byte at a time without buffering, and try with 4096bytes block then and see the difference.

answered May 18 '11 at 08:01

phoxis

60,131
14
81
117

1

Isn't my case the exact opposite of what you're describing? – user541686 May 18 '11 at 08:02
@Mehrdad: yes, just notices, sorry for that. I will tell you to try making reads in chunks of the size of the disk buffer size, time it, and then try reading twice the disk buffer size, and then time it, and check whats going on. – phoxis May 18 '11 at 08:05
@Mehrdad: approximately how much time does both of them take ? – phoxis May 18 '11 at 08:09
@Mehrdad: not possible for me to tell exactly (sorry), might a factor of how the read function is implemented, pagefaults, and the max block transfer, or other factors. – phoxis May 18 '11 at 08:30

score 0 · Answer 5 · answered Jun 10 '11 at 19:57

A possible explanation in my opinion would be command queueing with FILE_FLAG_NO_BUFFERING, since this does direct DMA reads at low level.

A single large request will of course still necessarily be broken into sub-requests, but those will likely be sent more or less one after another (because the driver needs to lock the pages and will in all likelihood be reluctant to lock several megabytes lest it hits the quota).

On the other hand, if you throw a dozen or two dozen requests at the driver, it will just forward them to the disk and the disk and take advantage of NCQ.

Well, that's what I'm thinking might be the reason anyway (this does not explain why the exact same phenomenon happens with buffered reads though, as in the Q that I linked to above).

This would be a great reason, except that I'm 95% sure I saw this happen on a PATA drive too. — user541686, Jun 17 '11 at 01:59

score 0 · Answer 6 · answered Jun 18 '11 at 01:27

What you are probably observing is that when using smaller blocks, the second block of data can be read while the first is being processed, then the third read while the second is being processed, etc. so that the speed limit is the slower of the physical read time or the processing time. If it takes the same amount of time to process one block as to read the next, the speed could be double what it would be if processing and reading were separate. When using larger blocks, the amount of data that is read while the first block is being processed will be limited to amount smaller than the block size. When the code is ready for the next block of data, part of it will have been read but some of it will not; it will thus be necessary for the code to wait while the remainder of the data is fetched.

Why is file I/O in large chunks SLOWER than in small chunks?

Edit 1:

Edit 2:

6 Answers6