1

On POSIX systems, I am able to use the mmap function to read the contents of a file faster than getline, getc, etc. This is important in the program that I am developing as it is expected to read very large files into memory; iteratively collecting lines using getline is too costly. Portability is also a requirement of my software, so if I use mmap, I need to find a way to memory map files using the WinApi, as I'd rather not compile through cygwin/msys. From a cursory search I identified this MSDN article which describes very briefly a way to map files into memory, however, from trawling through documentation I can't make head nor tails of how to actually implement it, and I'm stuck on finding example snippets of code, like there are for POSIX mmap.

How do I use the WinApi's memory mapping options to read a file into a char*?

  • [Creating Named Shared Memory](https://learn.microsoft.com/en-us/windows/win32/memory/creating-named-shared-memory). – IInspectable Jul 13 '21 at 19:32
  • 1
    Here's a link from SO with code: https://stackoverflow.com/questions/22047673/transferring-data-through-a-memory-mapped-file-using-win32-winapi It has a link to an MSDN article, but here's another [top level] link: https://learn.microsoft.com/en-us/windows/win32/memory/file-mapping – Craig Estey Jul 13 '21 at 20:03
  • There's probably a [thin] wrapper library that implements `mmap` in terms of the win32 API. Maybe: https://learn.microsoft.com/en-us/cpp/c-runtime-library/run-time-routines-by-category?view=msvc-160 What I'd do is code assuming that `mmap` is available. If you can't find a `mmap` wrapper function, write one of your own, using the above win32 API calls – Craig Estey Jul 13 '21 at 20:08
  • 2
    I am yet to see a proof that sequential reads from memory mapped files are faster than sequential reads from buffered streams like C IO (absent special techniques like prefetch). Most likely you are not comparing apples to apples. – SergeyA Jul 13 '21 at 20:14
  • @SergeyA `mmap` is _absolutely_ faster. See my answer: https://stackoverflow.com/questions/33616284/read-line-by-line-in-the-most-efficient-way-platform-specific/33620968#33620968 – Craig Estey Jul 13 '21 at 20:16
  • 1
    @CraigEstey like I said, this is not comparing apples to apples. The first snippet uses string routines from CRT, while the other example hand-rolls string parsing in buffer. Show me an example when one first `fread`s into buffer of 4k and than parses it - vs mmaping and parsing - and then it will be apples to apples. – SergeyA Jul 13 '21 at 20:44
  • This doesn't add much to the debate of whether mmap is faster or not, but speed isn't the only reason I want to use mmaping. I also just want to have several fallback methods for reading a file into memory, so that the eventual program is more robust. A lot of the time when I've seen programs that solely use getline or fgetc, trying to use the program with a sufficiently large text file (sometimes as small as a couple mb) is enough for the program to segfault. – Bithov Vinu Jul 13 '21 at 21:02
  • @SergeyA I believe `fread` suffers from the same problem as `read` [as mentioned in the 1st paragraph]: For _line_ oriented input, an `fread/read` may chop the buffer in the middle of the line. If you believe `fread` is faster, feel free to hack up my benchmark [from the pastebin link in the comments]. FYI, the `mmap` thing is the answer to an interview question at FB re. performance with multithreaded parsing of large (>1GB) files. – Craig Estey Jul 13 '21 at 22:14
  • @CraigEstey First, `fread` is buffered and it will be faster if you read less than C buffer size at a time. Otherwise, it will have the performance comparable with `read`. Second, yes, both reads can (and will!) read in between the lines. It is up to you, the developer, to work around it - as there are ways. I never said you can write *naive* code with reads. Third, you mentioning facebook interview question is a non-sequitur. – SergeyA Jul 13 '21 at 22:19
  • @BithovVinu I'm not sure what your current issue is. A well written [bug free] program won't segfault regardless of the access method (e.g. `fgetc`, `fgets`, `*read`, or `mmap`), unless you're trying to `malloc` all the tokens and run out of memory. So, the links give you the tools to implement an `mmap` wrapper (e.g. `osfree_mmap`), tailored to your needs. You may need to create a control struct that hides `int` vs `HANDLE`, etc. and holds the OS specific data. Perhaps you could _edit_ your question and post your use case. – Craig Estey Jul 13 '21 at 22:34
  • Regarding `mmap` vs `read`: Both need to do the very same thing, open the file and copy its contents into memory. As long as all the contents is accessed after `mmap`, I'd say both methods need to be similarly fast. – the busybee Jul 14 '21 at 06:23
  • *I am able to use the mmap function to read the contents of a file faster than getline, getc, etc* If speed is that important, how did you ***measure*** that difference? Did you try changing buffer size using `setvbuf()`? Did you trying using `read()` with `O_DIRECT`? There's nothing wrong with `mmap()`, but the way you're reading the file (once, sequentially) is probably one of the ***slowest*** ways to read a file. `mmap()` has to do a lot of single-threaded virtual memory operations, and then the data has to be page-faulted in. That's all S-L-O-W. `mmap()` is ***not*** really fast. – Andrew Henle Jul 14 '21 at 07:57
  • @thebusybee *Regarding mmap vs read: Both need to do the very same thing* Except that `mmap()` also has to do a lot of virtual memory operations, and those aren't fast. `read()` doesn't have to do any of that. – Andrew Henle Jul 14 '21 at 10:38

1 Answers1

4

How do I use the WinApi's memory mapping options to read a file into a char*?

Under Windows, when you map a file in memory, you get a pointer to the memory location where the first byte of the file has been mapped. You can cast that pointer to whatever datatype you like, including char*.

In other words, it is Windows which decide where the mapped data will be in memory. You cannot provide a char* and expect Windows will load data there.

This means that if you already have a char* and want the data from the file in the location pointed by that char*, then you have to copy it. Not a good idea in terms of performances.

Here is a simple program dumping a text file by mapping the file into memory and then displaying all ASCII characters. Tested with MSVC2019.

#include <stdio.h>
#include <Windows.h>

int main(int argc, char *argv[])
{
    TCHAR *lpFileName = TEXT("hello.txt");
    HANDLE hFile;
    HANDLE hMap;
    LPVOID lpBasePtr;
    LARGE_INTEGER liFileSize;

    hFile = CreateFile(lpFileName, 
        GENERIC_READ,                          // dwDesiredAccess
        0,                                     // dwShareMode
        NULL,                                  // lpSecurityAttributes
        OPEN_EXISTING,                         // dwCreationDisposition
        FILE_ATTRIBUTE_NORMAL,                 // dwFlagsAndAttributes
        0);                                    // hTemplateFile
    if (hFile == INVALID_HANDLE_VALUE) {
        fprintf(stderr, "CreateFile failed with error %d\n", GetLastError());
        return 1;
    }

    if (!GetFileSizeEx(hFile, &liFileSize)) {
        fprintf(stderr, "GetFileSize failed with error %d\n", GetLastError());
        CloseHandle(hFile);
        return 1;
    }

    if (liFileSize.QuadPart == 0) {
        fprintf(stderr, "File is empty\n");
        CloseHandle(hFile);
        return 1;
    }

    hMap = CreateFileMapping(
        hFile,
        NULL,                          // Mapping attributes
        PAGE_READONLY,                 // Protection flags
        0,                             // MaximumSizeHigh
        0,                             // MaximumSizeLow
        NULL);                         // Name
    if (hMap == 0) {
        fprintf(stderr, "CreateFileMapping failed with error %d\n", GetLastError());
        CloseHandle(hFile);
        return 1;
    }

    lpBasePtr = MapViewOfFile(
        hMap,
        FILE_MAP_READ,         // dwDesiredAccess
        0,                     // dwFileOffsetHigh
        0,                     // dwFileOffsetLow
        0);                    // dwNumberOfBytesToMap
    if (lpBasePtr == NULL) {
        fprintf(stderr, "MapViewOfFile failed with error %d\n", GetLastError());
        CloseHandle(hMap);
        CloseHandle(hFile);
        return 1;
    }

    // Display file content as ASCII charaters
    char *ptr = (char *)lpBasePtr;
    LONGLONG i = liFileSize.QuadPart;
    while (i-- > 0) {
        fputc(*ptr++, stdout);
    }

    UnmapViewOfFile(lpBasePtr);
    CloseHandle(hMap);
    CloseHandle(hFile);

    printf("\nDone\n");
}
fpiette
  • 11,983
  • 1
  • 24
  • 46