Memory-Mapped File is Faster on Huge Sequential Read? Why?

Question

I used the code below to measure the performance difference between reading large, sequential reads of a memory-mapped file, as compared to just calling ReadFile:

HANDLE hFile = CreateFile(_T("D:\\LARGE_ENOUGH_FILE"),
    FILE_READ_DATA, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING,
    FILE_FLAG_NO_BUFFERING, NULL);
__try
{
    const size_t TO_READ = 32 * 1024 * 1024;
    char sum = 0;
#if TEST_READ_FILE
    DWORD start = GetTickCount();
    char* p = (char*)malloc(TO_READ);
    DWORD nw;
    ReadFile(hFile, p, TO_READ, &nw, NULL);
#else
    HANDLE hMapping = CreateFileMapping(hFile, NULL, PAGE_READONLY,
        0, 0, NULL);
    const char* const p = (const char*)MapViewOfFile(hMapping,
        FILE_MAP_READ, 0, 0, 0);
    DWORD start = GetTickCount();
#endif
    for (size_t i = 0; i < TO_READ; i++)
    {
        sum += p[i]; // Do something kind of trivial...
    }
    DWORD end = GetTickCount();
    _tprintf(_T("Elapsed: %u"), end - start);
}
__finally { CloseHandle(hFile); }

(I just changed the value of TEST_READ_FILE to change the test.)

To my surprise, ReadFile was slower by ~20%! Why?

Not really... how can I be sure? I've turned off the file system cache but I can't do much about the disk cache... — user541686, Mar 10 '11 at 08:21
@Thilo: There's no "order" to the tests -- notice that only *one* of them happens at every run, and I alternated the runs. How would rebooting change anything? — user541686, Mar 10 '11 at 08:28
Rebooting (as in power-cycle) should clear the cache. By order I mean the order in which you run the two test programs. — Thilo, Mar 10 '11 at 08:31
@Thilo: I understood what you meant by rebooting, but why would that change anything? The disk cache is below *both* kinds of reads, so it wouldn't matter. And like I said, regarding the order, there *is no order* -- I alternated setting `TEST_READ_FILE` to `0` and to `1` every run, so there's no ordering here. — user541686, Mar 10 '11 at 08:35
It might be worth it to run each block on many different files of the same size to try and avoid caching issues, as well. But I would imagine since `ReadFile` uses an internal buffer that it then copies to the supplied pointer while the memory mapped file method should only need one memory write, the extra memory access is the cause of the performance difference, especially so if `p` and `ReadFile`'s buffers' addresses cause thrashing in the processor caches. — jswolf19, Mar 10 '11 at 08:58
@Mehrdad: Of course there's an order -- only one of the tests can be run on a cold cache, because running the test warms the cache. — Ben Voigt, Mar 10 '11 at 13:13
@Ben: So if there's an order, then what's the order? Which one's running "first"? @jswolf19: That sounds reasonable -- it does seem like a CPU caching issue... — user541686, Mar 10 '11 at 14:20
@Mehrdad: After boot, did you first run the version with `TEST_READ_FILE`? @jswolf: You'd only get thrashing with a non-associative cache, most CPUs these days are set-associative. — Ben Voigt, Mar 10 '11 at 15:26
@Ben: I just rebooted and tried it the other way; no difference. It seems like the loop for adding up the values is what causes the difference -- when I take it out from the `ReadFile` version, the two speeds match. — user541686, Mar 10 '11 at 17:44
@Mehrdad: For the memory-mapped file, `MapViewOfFile` returns immediately and you can start processing the data after only a single page is filled -- subsequent reads happen in parallel with processing, even though `FILE_FLAG_OVERLAPPED` was not specified. Conversely, `ReadFile` doesn't return until the entire file is read. Using `ReadFile` on an asynchronous file handle (opened with `FILE_FLAG_OVERLAPPED`) should give similar results to `MapViewOfFile`. The recommendation for for a single sequential scan through a huge file (that can't fit in cache) is to use unbuffered overlapped I/O. — Ben Voigt, Mar 10 '11 at 18:29
@Ben: Something that's bothered me about using overlapped I/O here: What's the point? I can't possibly know how much data is already read into the buffer, so I'd have to wait until the reading is done anyway, right? — user541686, Mar 10 '11 at 18:41
@Mehrdad: You issue a bunch of smaller requests. Since the OS gets all the requests up front, it can optimize the read order (in case the file is fragmented or part is already in cache) but each piece will complete individually. Say 8 requests for 4 MB each is not increasing the amount of control data by much but will give 87% parallelization. — Ben Voigt, Mar 11 '11 at 00:38
@Ben: Ah, I see... so you'd only give the request in smaller chunks, not a single large chunk. — user541686, Mar 11 '11 at 01:24
@Mehrdad: Yes, because although you still don't know how much data is in the incomplete buffers, you are told which of the smaller chunks have completed. — Ben Voigt, Mar 11 '11 at 01:29
@Mehrdad: How is the timing fair? You time malloc() in ReadFile case, but time nothing in memory mapped case. I understand that the calls to CreateFileMapping() and MapViewOfFile() might be cheap, but it still looks a bit awkward the way you have it. Or is there some sense in measuring the times like this? — the swine, Nov 11 '13 at 14:40
@theswine: It's a really good point, unfortunately it's been 2 years since this question and I don't remember much about it (I don't have the same computer to test it on, either). I think you're probably right though -- `malloc` skews the results dramatically because touching every page to force pages to be committed slows things down. Whether that's fair or not, I'm not sure... I could argue either way. (For example, since file mappings don't require malloc, you could argue it's an extra burden you have to account for.) — user541686, Nov 11 '13 at 15:25
@Mehrdad: I see. Nevermind. What I meant was just moving the start time above the #ifdef (so there is only one). That seems fair to me :). But don't worry about it, I can try it on my computer sometimes when I'm bored. — the swine, Nov 11 '13 at 18:08

score 7 · Accepted Answer · answered Mar 10 '11 at 13:15

7

FILE_FLAG_NO_BUFFERING cripples ReadFile. The memory-mapped file is free to use whatever read-ahead algorithm it wants, and you've forbidden ReadFile to do the same. You've turned off caching only in the ReadFile version. Memory-mapped files can't work without file cache.

answered Mar 10 '11 at 13:15

Ben Voigt

277,958
43
419
720

@Ben: Wait, why wouldn't the memory-mapped file respect the flag? – user541686 Mar 10 '11 at 17:44
@Mehrdad: How could it? MapViewOfFile maps pages from the file cache into your process memory space using the MMU. – Ben Voigt Mar 10 '11 at 17:49
@Ben: I had no idea how it worked, but I imagined that it caused a page fault whenever a page was accessed, so the kernel could take control and fill the page with the data -- so there didn't actually have to be any caching. – user541686 Mar 10 '11 at 17:51
You should also use `FILE_FLAG_SEQUENTIAL_SCAN` for large sequential reads. – Ben Voigt Mar 10 '11 at 17:53
1

@Mehrdad: It only does that the **first** time a page is accessed. Re-reading each page 4000 times would be crazy. And it's the cache manager which fills that page with data and decides when to free it... this is how memory-mapped files stay consistent between processes, because the same physical page is mapped by the cache manager into multiple processes. Your process doesn't own the memory. – Ben Voigt Mar 10 '11 at 17:56
@Mehrdad: Interesting, and a nice question. Anyway, in your case, did FILE_FLAG_NO_BUFFERING get the 20% back so that ReadFile() would be as fast as memory mapped? What was the final outcome in your particular case? – the swine Nov 11 '13 at 14:35

Memory-Mapped File is Faster on Huge Sequential Read? Why?

1 Answers1

Linked