How to benchmark the fastest method to copy a file in Windows (memory-mapped vs. file I/O)

Question

(For the purposes of this question, I'm disregarding file-copying APIs such as CopyFile, etc.)

I'm trying to answer the question: if I need to copy many large files, which method is fastest?

I can think of four basic methods for copying a file:

ReadFile + WriteFile
MapViewOfFile the source into memory, and WriteFile the buffer to the destination
MapViewOfFile the destination into memory, and ReadFile the source into the buffer
MapViewOfFile both files, and memcpy from one file to another

Furthermore, in each case, there are also some options I can set, such as FILE_FLAG_NO_BUFFERING and SEC_LARGE_PAGE.

However, I don't know how to properly benchmark this. I've written the following code:

#include <stdio.h>
#include <time.h>
#include <tchar.h>
#include <Windows.h>

void MyCopyFile(HANDLE source, HANDLE sink, bool mapsource, bool mapsink)
{
    LARGE_INTEGER size = { 0 };
    GetFileSizeEx(source, &size);
    HANDLE msource = mapsource ? CreateFileMapping(source, NULL, PAGE_READONLY, 0, 0, NULL) : NULL;
    HANDLE msink = mapsink ? CreateFileMapping(sink, NULL, PAGE_READWRITE, size.HighPart, size.LowPart, NULL) : NULL;
    void const *const psource = mapsource ? MapViewOfFile(msource, FILE_MAP_READ, 0, 0, size.QuadPart) : NULL;
    void *const psink = mapsink ? MapViewOfFile(msink, FILE_MAP_WRITE, 0, 0, size.QuadPart) : NULL;
    clock_t const start = clock();
    unsigned long nw = 0;
    if (mapsource)
    {
        if (mapsink)
        {
            memcpy(psink, psource, size.QuadPart);
            nw = size.QuadPart;
        }
        else
        { WriteFile(sink, psource, size.QuadPart, &nw, NULL); }
    }
    else
    {
        if (mapsink)
        { ReadFile(source, psink, size.QuadPart, &nw, NULL); }
        else
        {
            void *const buf = malloc(size.QuadPart);
            if (!ReadFile(source, buf, size.QuadPart, &nw, NULL)) { fprintf(stderr, "Error reading from file: %u\n", GetLastError()); }
            if (!WriteFile(sink, buf, size.QuadPart, &nw, NULL)) { fprintf(stderr, "Error writing to file: %u\n", GetLastError()); }
            free(buf);
        }
    }
    FlushViewOfFile(psink, size.QuadPart);
    clock_t const end = clock();
    if (mapsource) { UnmapViewOfFile(psource); }
    if (mapsink) { UnmapViewOfFile(psink); }
    if (mapsource) { CloseHandle(msource); }
    if (mapsink) { CloseHandle(msink); }
    if (nw) { fprintf(stderr, "(%d, %d): %u MiB/s\n", mapsource, mapsink, (unsigned int)(size.QuadPart * CLOCKS_PER_SEC / (((long long)(end - start) << 20) + 1))); }
}
int main()
{
    // Request permission to extend file without zeroing, for faster performance
    {
        enum TokenPrivilege { SeManageVolumePrivilege = 28 };
        typedef NTSTATUS NTAPI PRtlAdjustPrivilege(IN TokenPrivilege Privilege, IN BOOLEAN Enable, IN BOOLEAN Client, OUT PBOOLEAN WasEnabled);
        static PRtlAdjustPrivilege &RtlAdjustPrivilege = *(PRtlAdjustPrivilege *)(GetProcAddress(GetModuleHandle(_T("ntdll.dll")), _CRT_STRINGIZE(RtlAdjustPrivilege)));
        BOOLEAN old; RtlAdjustPrivilege(SeManageVolumePrivilege, TRUE, FALSE, &old);
    }
    for (int i = 0;; i++)
    {
        HANDLE source = CreateFile(_T("TempSource.bin"), FILE_READ_DATA | FILE_WRITE_DATA | SYNCHRONIZE, FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, NULL, OPEN_ALWAYS, FILE_FLAG_DELETE_ON_CLOSE | FILE_FLAG_NO_BUFFERING | FILE_FLAG_SEQUENTIAL_SCAN, NULL);
        HANDLE sink = CreateFile(_T("TempSink.bin"), FILE_READ_DATA | FILE_WRITE_DATA | SYNCHRONIZE, FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, NULL, OPEN_ALWAYS, FILE_FLAG_DELETE_ON_CLOSE | FILE_FLAG_NO_BUFFERING | FILE_FLAG_SEQUENTIAL_SCAN, NULL);
        LARGE_INTEGER size; size.QuadPart = 1 << 26;
        LARGE_INTEGER zero = { 0 };
        SetFilePointerEx(source, size, &size, FILE_BEGIN);
        SetEndOfFile(source);
        SetFileValidData(source, size.QuadPart);
        SetFilePointerEx(source, zero, &zero, FILE_BEGIN);
        SetFilePointerEx(sink, zero, &zero, FILE_BEGIN);
        MyCopyFile(source, sink, i % 2 != 0, i / 2 % 2 != 0);
        FlushFileBuffers(source);
        FlushFileBuffers(sink);
        if ((i % 4) + 1 == 4) { fprintf(stderr, "\n"); }
        CloseHandle(source);
        CloseHandle(sink);
    }
}

Unfortunately my code gives me wildly varying results on the first iteration than the following iterations, so I have a hard time figuring out how to benchmark this operation.

Which method should be fastest, and how do I properly benchmark my system to confirm this?

File copy speeds are a bit of a black art. It seems to mostly come down to the buffer size you use. Theoretically the fastest method should be using overlapped I/O which rules out memory mapping. Memory mapping would also mean on a 32 bit system you couldn't copy files greater than about 3GB. — Jonathan Potter, Feb 15 '15 at 21:26
Don't forget also that the filesystem will cache (some of) the data you read so you can't get reliable results without flushing the cache somehow (i.e. by rebooting) in between tests. — Jonathan Potter, Feb 15 '15 at 21:28
@JonathanPotter: Yeah I'm just testing this on a 64-bit system for now. Not sure why overlapped I/O rules out memory mapping though... can't I just read or write to a memory-mapped file in an overlapped fashion? As for flushing the cache, I do have `FILE_FLAG_NO_BUFFERING` set, so that should be enough, right? Or would that give me unrealistic speeds? — user541686, Feb 15 '15 at 21:55
With overlapped I/O you can start a write going and then go and read some more data. With memory mapping, your memcpy call doesn't return until it's actually finished. Maybe you could get something working with multiple threads? Would be worth a try I suppose. — Jonathan Potter, Feb 15 '15 at 22:15
@JonathanPotter: Oh, yeah I realize memcpy is probably slow, I just put it there for completeness. But 2 of the other 4 methods are also based on memory-mapping, and those don't use memcpy. — user541686, Feb 15 '15 at 22:15
The hardware cache might affect this too, so `FILE_FLAG_NO_BUFFERING` probably isn't sufficient. (And probably doesn't affect memory mapping anyway.) I think the only way to make a benchmark valid is to power cycle the system each time - and even then you probably need to run multiple iterations and make sure they're consistent. — Harry Johnston, Feb 16 '15 at 04:33
As to which *should* be fastest - for large files I think they *should* all be pretty much the same, as determined by the actual speed of the hardware. Even top-end PCI SSDs are still significantly slower than RAM, as I understand things. — Harry Johnston, Feb 16 '15 at 04:38
Overlapped I/O, although it sounds cool in theory and on paper, will almost certainly not be the fastest possible way as itis an ill-advised approach. For large files, as Harry Johnston points out, all methods _should_ be the same, but for small and medium sizes, either filemapping or standard I/O will usually be faster, sometimes by orders of magnitude. — Damon, Feb 17 '15 at 21:48
The fastest file copy routine I know of is implemented in the [FastCopy](http://ipmsg.org/tools/fastcopy.html.en) utility. Source is available. As I recall, he uses `ReadFile` / `WriteFile` without buffering, etc. It's unlikely that mapping the file would be faster than `ReadFile` / `WriteFile`, because the file mapping has to do essentially what `ReadFile` does to read the file, and you're adding another API layer on top of that. — Jim Mischel, Feb 17 '15 at 22:23
@JimMischel: I imagine file mapping could be the same speed at best if the file is not already buffered, but if the file is already buffered, I would expect file mapping to be faster because it would avoid a memcpy from the kernel buffer to the user buffer. So that's why I'm testing it -- I would expect it should be faster in some cases, and similar in the worst case. — user541686, Feb 17 '15 at 22:32
I just thought of another variable: a copy on a single disk probably behaves differently to a copy between disks. In particular, in the latter scenario you need to read from one disk and write to the other *at the same time* to get the best performance. Synchronous ReadFile/WriteFile with buffering disabled would definitely be affected by that, not sure how much buffering would mitigate it. — Harry Johnston, Feb 18 '15 at 03:11
@Damon: what specifically is wrong with overlapped I/O? As I understand it, most or all of the synchronous I/O functions actually use overlapped I/O under the hood, so it seems unlikely that they inherently perform better. — Harry Johnston, Feb 18 '15 at 03:13
@HarryJohnston: The "no cache" bit is wrong. Overlapped I/O only works with disabled buffer cache (which in fact throws away the buffered pages for all processes). That's fine for read-once operations on huge data that doesn't fit RAM anyway, but it is a mighty stupid thing otherwise. Also, overlapped has many secret "turn into synchronous" situations apart from the well-documented ones. Memory mapping is much more friendly there, and very fast (and thanks to lazy writeback you even get speeds that are physically impossible, such as writing a DVD ISO onto iSCSI in under half a second). — Damon, Feb 18 '15 at 09:47
@Damon: "overlapped I/O only works with disabled buffer cache" ... do you have a reference for that? — Harry Johnston, Feb 18 '15 at 09:57
@HarryJohnston: Try it, and actually look at the return codes (don't trust that it "works" because something was read). Although it will _appear to work_, it secretly reverts to running synchronously. So while it pretends to work, it really doesn't. Similar stuff happens if you exceed some unknown magical number of requests or some magical unknown size. There's a bit of that in my 4-year old question [here](http://stackoverflow.com/q/5909345/572743). Note how _queueing_ a request takes 12ms. The summary is that overlapped is surprising at best and sucks otherwise. — Damon, Feb 18 '15 at 11:35
@HarryJohnston: For an "overlapped" or "aio" system to truly "work", the expectation is that you queue a request which takes near zero time (maybe a few hundred cycles) and at some point in time later you get notified or are able to query/synchronize, and your request is done, at the maximum possible speed. The true, observeable behavior is much different (not just on Windows but also on Linux btw). Buffer cache and asynchronous are enemies (for no apparent reason?), and strange things happen (like queueing a request taking tens of milliseconds) when you least expect them. — Damon, Feb 18 '15 at 11:39
@Damon: I'm not entirely convinced that using synchronous I/O with multiple threads would necessarily do any better, but it would obviously be simpler and might not do much worse. I still prefer async I/O to multiple threads just because makes the code simpler to reason about, but I guess I've just never run across your problem cases. — Harry Johnston, Feb 18 '15 at 19:19
Damon is right, and I can't believe my eyes. I just tried to write a counterexample and instead it turned into a proof. @HarryJohnston, try running [this](https://pastebin.com/cuHyiRKp) on your machine: give the program the names of some files >= 64 MiB and see what happens as it tries to read them with buffering enabled (change `flags` to disable buffering). — user541686, Feb 18 '15 at 20:00
@Mehrdad: that *is* odd, yes. Curiously, when I tried this on my last Windows 2003 server, the I/O simply failed, error 1450, insufficient system resources. It also works as we might have expected in some circumstances, e.g., a network drive. — Harry Johnston, Feb 18 '15 at 20:25
@HarryJohnston: Oh, the failure is probably because the read size is too big; I remember Windows XP didn't work well with reads >= 32 MiB. Try reducing it in the code to (say) 8 or 16 MiB instead of 64 MiB. — user541686, Feb 18 '15 at 20:49
Splitting the read into multiple smaller blocks doesn't seem to help either, except on the Windows 2003 server, where it then works as expected; i.e., the queuing latency is much lower than the read time, unless the data is in the cache, in which case the I/O completes synchronously (but quickly). OK, I concede; asynchronous I/O on files is indeed pretty darn broken. (It doesn't seem to be any slower than synchronous I/O, though, so it may still be useful in scenarios that can benefit from I/O that *might* or *might not* be asynchronous!) — Harry Johnston, Feb 18 '15 at 20:49
@HarryJohnston: Interesting. I actually think I've observed async I/O perform slower than synchronous I/O before (probably because it's under less of an obligation to return to the caller ASAP), but I'm not sure I have the time to code it now... though if I have a few minutes I'll try to code that too. — user541686, Feb 18 '15 at 20:51
@Mehrdad: yeah, some of the older documentation seems to imply that, though I have a sneaking suspicion that they may have fixed that at the same time that they broke everything else! But I didn't do exhaustive testing, my results might not be typical. — Harry Johnston, Feb 18 '15 at 20:55
I finally noticed that the documentation for `FILE_FLAG_NO_BUFFERING` subtly mentions that cached behavior is always synchronous, since it says this flag *"gives maximum asynchronous performance, because the I/O does not rely on the synchronous operations of the memory manager"*. — user541686, Mar 08 '15 at 02:02

score 0 · Answer 1 · answered Feb 17 '15 at 21:16

I think it depends on how serious your performance tracking needs to be. You can do timers as in measuring how long things take on your own, or you might think about using ETW. ETW is how all performance things are tracked in Windows. Also, ETW is how Windows does high performance events and tracing. If you are doing system perf tracing, you are using ETW. Also, once your component is hooked up correctly with ETW, you can trace your performance all the way through the system from your user mode components, all the way down the kernel mode. I am not really a VS user, but I think there are tools to automatically add profiling to your components. I've used tools like F1 and TDD...which may have been integrated into VS at some point.

Also, there are some crazy tools you can use to deep dive on the resulting ETL file of your performance traces. Like, are you interested in heap fragmentation, or CPU time for a given stack? With proper performance tracing, you can figure out basically any dimension of your (or the system's) performance.

One of the basic concepts is activity IDs. It is a GUID that is set on thread local storage. It is what stitches together events/scenarios.

Start by trying to capture performance traces. Figure out how to decode the ETL files. Start adding activity IDs to you code. Decode, the ETL file and start measuring the perf or your scenario.

Anyway, this is how serious system performance tracing is done. Hopefully this is a helpful starting place.

If you don't need to be that serious, then just use timers in your code.

void GetSetActivityId(
    _Out_ GUID *pActivityId) 
{
GUID activityId = {0};
GUID guidNull = {0};

TRACE_FUNCTION_ENTRY(LEVEL_INFO);

// check to see if there is already an activity id on the thread
(void)EventActivityIdControl(EVENT_ACTIVITY_CTRL_GET_ID, &activityId);

if (RtlCompareMemory(&guidNull, &activityId, sizeof(GUID)) == sizeof(GUID)) {
    // we will create and set an activity id because we didn't get one from the thread
    if (EventActivityIdControl(EVENT_ACTIVITY_CTRL_CREATE_ID, &activityId) == ERROR_SUCCESS) {
        (void)EventActivityIdControl(EVENT_ACTIVITY_CTRL_SET_ID, &activityId);
    }
}

TRACE_FUNCTION_EXIT(LEVEL_COND);

*pActivityId = activityId;
}  // GetSetActivityId

How to benchmark the fastest method to copy a file in Windows (memory-mapped vs. file I/O)

1 Answers1