Optimizing an I/O bound Win32 application

Question

I'm trying to optimizie an I/O bound C++ Win32 application. What it actually does is something very similar to recurse a folder and compute a cryptographic hash of every file it finds. It's a single threading application using memory mapped files, so as it's easy to imagine it doesn't seem to use much CPU since most of the times the main thread is put to sleep waiting for I/O to complete. I'm thinking about a couple of solutions, but I'm not sure about it so I'd like to have your opinions.

I could spawn many threads ( having a fixed size pool of workers to keep memory usage under certain threshold ), but honestly I don't know if this can make the situation better, each thread is gonna be put to sleep just like my main thread in the current implementation, plus the scheduler would "waste" lot of computation power to switch contexts.
I was thinking about I/O completion ports ( single thread? multi? ), but that would mean to abandon the memory mapped files ( am I wrong ? ) and use standard file operations. If this is the case, could you please provide me some example code on how to use IOCP to read and elaborate a given list of files without putting the reading thread to sleep ?

Any other idea/suggestion/etc would be really appreciated :)

Thanks.

You might want to first verify your assumptions regarding performance, by using Perfmon, Task Manager, or any other profiler. — Chris O, Dec 05 '13 at 00:52
those are not assumptions, I've already tested the app with this kind of tools :) — Simone Margaritelli, Dec 05 '13 at 00:56
Ah very good then, my only advice now is to try different techniques for reading from the file system, in the hopes that some other API might be better at it. Try things from CRT, C++ libs, and win32 API, maybe you'll get lucky. Though if the process is IO bound, then using extra threads won't really help at all. — Chris O, Dec 05 '13 at 01:08
I'd lean towards the first option, IOCP seems like overkill for what you're looking to do. What's the breakdown in time (percentage-wise) for reading the contents of the file versus computing the hash? — Mike O'Connor, Dec 05 '13 at 01:16
@SimoneMargaritelli Have you verified that you're not already saturating the read bandwidth of the storage device? — MooseBoys, Dec 05 '13 at 01:16
It doesn't sound like there is much to gain in your case, but with one thread for reading and one thread for computing the hash you can make the two operations concurrent, especially on a multiprocessor machine. — ScottMcP-MVP, Dec 05 '13 at 02:25
@MooseBoys yep, the I/O is very low ( some KB/s ), and CPU usage too ( around 1/2% max ), which is why I'm try to optimize this to use more system resources. — Simone Margaritelli, Dec 05 '13 at 13:41
@SimoneMargaritelli I've done this before in C# and even using non-blocking IO etc, I found that the filesystem is the blocker in most cases. Because you're in a multiprocess environment the filesystem has to set several very expensive mutexes on each file which requires (briefly) acquiring a global mutex. As such the code to open a file is horrendously expensive, as is the code to close it. — Mgetz, Dec 05 '13 at 14:06

score 1 · Answer 1 · answered Dec 05 '13 at 06:29

My experience indicates that memory mapping isn't particularly fast, so that would probably be the first thing I'd abandon.

Threading (explicitly or via IOCPs) probably won't do much good either, unless the target system has lots of disk drives, and can split things up so different threads are operating on different physical drives.

Once you've given up on memory mapping and do explicit I/O, you probably want to use FILE_FLAG_NO_BUFFERING and read relatively large blocks (say, a few megabytes at a time). Do check the alignment requirements on your block of memory though--they're a little tricky (or maybe "tedious" would be a better word to describe them). Also note that this only works for reads that are multiples of the disk's sector size, so in a typical case you need to open the file twice, once with FILE_FLAG_NO_BUFFERING to read the bulk of the data, then again without that flag to read the "tail" of the file.

Although it only copies a file (rather than processing the contents), and it's probably pure C, not C++, perhaps some demo code will be of at least a little help:

int do_copy(char const *in, char const *out) {

    HANDLE infile;
    HANDLE outfile;
    char *buffer;
    DWORD read, written;
    DWORD junk=0;
    unsigned long little_tail;
    unsigned long big_tail;
    unsigned __int64 total_copied = 0;
    unsigned __int64 total_size = 0;
    BY_HANDLE_FILE_INFORMATION file_info;

#define size (1024 * 8192)

    buffer = VirtualAlloc(NULL, size, MEM_COMMIT, PAGE_READWRITE);
    if ( NULL == buffer)
        return 0;

    infile = CreateFile(in, 
        GENERIC_READ, 
        FILE_SHARE_READ,
        NULL,
        OPEN_ALWAYS, 
        FILE_FLAG_NO_BUFFERING, 
        NULL);

    GetFileInformationByHandle(infile, &file_info);
    total_size = (unsigned __int64)file_info.nFileSizeHigh << 32 | (unsigned __int64)file_info.nFileSizeLow / 100;

    outfile = CreateFile(out, 
        GENERIC_WRITE, 
        FILE_SHARE_READ,
        NULL,
        CREATE_ALWAYS, 
        FILE_FLAG_NO_BUFFERING, 
        NULL);

    if ((infile == HNULL) || (outfile == HNULL))
        return 0;

    while (ReadFile(infile, buffer, size, &read, NULL) && read == size) {
        WriteFile(outfile, buffer, read, &written, NULL);
        total_copied += written;
        fprintf(stderr, "\rcopied: %lu %%", (unsigned long)(total_copied / total_size));
    }

    little_tail = read % 4096;
    big_tail = read - little_tail;

    WriteFile(outfile, buffer, big_tail, &written, NULL);

    CloseHandle(infile);
    CloseHandle(outfile);

    outfile = CreateFile(out, 
        GENERIC_WRITE, 
        0,
        NULL,
        OPEN_EXISTING,
        FILE_FLAG_SEQUENTIAL_SCAN, 
        NULL);
    fprintf(stderr, "\rcopied: 100 %%\n");

    SetFilePointer(outfile, 0, &junk, FILE_END);
    WriteFile(outfile, buffer+big_tail, little_tail, &written, NULL);
    CloseHandle(outfile);
    VirtualFree(buffer, size, MEM_RELEASE);
    return 1;
}

isn't reading a big chunk of data inside a VirtualAlloc'd buffer the same as using memory mapped files? — Simone Margaritelli, Dec 05 '13 at 13:41
Actually you don't need to (re-)open the file to read the 'tail'. The ReadFile call will just correctly return the number of bytes available read. — Lawrence Kok, Dec 05 '13 at 14:30
@SimoneMargaritelli: Conceptually, they're pretty much the same. At least in my experience, the actual effect (especially the performance) is quite a bit different though. — Jerry Coffin, Dec 05 '13 at 15:34

score 1 · Accepted Answer · edited May 23 '17 at 12:13

Before parallelizing anything, always ask yourself first: Does the added complexity justify the performance gained? In order to answer that question with minimal effort, just test how much % of max read throughput you already have. That is, test your current read throughput and then test max throughput. Don't use the theoretical max for this computation. Then, think about how much complexity and how many possible issues are introduced in even the simplest approach to gain the last few %.

As already mentioned in the comments, the greatest performance gain here is probably achieved by pipelining (i.e. overlapping computation and I/O). And the easiest way to implement that is with asynchronous reads. This thread lists multiple ways to implement asnychronous file I/O in C++.

If you don't need portability, just use the Windows OVERLAPPED API. Boost ASIO does not seem to make File I/O very easy (yet). I could not find any good examples.

Note that, depending on your system configuration, you have to launch multiple threads to fully saturate I/O bandwidth (especially if files of that folder actually reside on multiple disks, which is possible). Even if you only read from one device, you might fair (slightly) better with multiple threads to mitigate OS overhead.

I updated my answer. I recommend the [Windows OVERLAPPED API](http://www.dreamincode.net/forums/topic/165693-microsoft-working-with-overlapped-io/) — Domi, Dec 05 '13 at 13:50
Multiple threads will saturate the scheduler without necessarily increasing throughput. IBM copied IOCPs into their UNIX variants because IOCPs are so fast. — Olof Forshell, Oct 18 '22 at 19:26

score 1 · Answer 3 · answered Oct 17 '22 at 13:01

Be wary about multi-threading I/O bound scenarios. Here I have a case that it makes things only slightly faster on a SSD but a lot slower on a HDD. The following results are a test of reading a big file per thread on Windows 10:

SSD:

1 thread 200 MB/s
2 threads 100 MB/s

HDD:

1 thread 100 MB/s
2 threads 10 MB/s

score 1 · Answer 4 · answered Oct 18 '22 at 19:08

Why does everyone appear so sure that IOCPs are out of the question? With IOCPs you can basically initiate every read you want to do and they will enter the SW as they are completed which more often than not will not be in the order they were issued. Then you do your crypto stuff and store the hash or whatever. In the meantime one or more IOCP reads will have completed and you can do your crypto stuff on them.

Is it not even worth a shot?

score 0 · Answer 5 · answered Mar 23 '23 at 12:48

0

I can tell how I have successfully solved a very similar problem. Jpeg optimisarion

Let’s say we have four threads. Each thread is given the path and a threadnr from 0 to 3. Each thread needs to know te file number so it needs a global varaible like int gFileNumber[4]

The threads use finfdfirstfile/findnextfile and whenever they find a jpg (in my case), the filenumber is incremented gFileNumber[threadnumber]++

Now comes the the part I think is clever. If (threadnumber == gFileNumber[threadnumber] % 4) Then this thread should deal with this file. Jist calculate the hash. The threads will calvulate the hash for different files at the same time.

answered Mar 23 '23 at 12:48

Fredrik Wahlgren

107
1
4

The idea is that each thread determines whether it is responsible for a given file. You coild use filesize because if you calculate filesize mod 4, whatever you get will be equal to the threadnumber for only one of the threads. If you want a progressbar and the ability to calculate ETA, you first need a thread that counts the files. It could also sum the filesize. You need additional global variables, protected by critical sections. You now know how many files ypu have processed, how many kilobytes you have processed and you can use an average to estimate time to arrival. – Fredrik Wahlgren Mar 23 '23 at 13:22

Optimizing an I/O bound Win32 application

5 Answers5

Linked