Improving mmap memcpy file read performance

Question

I have an application that sequentially reads data from a file. Some is read directly from a pointer to the mmaped file and other parts are memcpyed from the file to another buffer. I noticed poor performance when doing a large memcpy of all the memory that I needed (1MB blocks) and better performance when doing a lot of smaller memcpy calls (In my tests, I used 4KB, the page size, which took 1/3 of the time to run.) I believe that the issue is a very large number of major page faults when using a large memcpy.

I've tried various tuning parameters (MAP_POPUATE, MADV_WILLNEED, MADV_SEQUENTIAL) without any noticeable improvement.

I'm not sure why many small memcpy calls should be faster; it seems counter-intuitive. Is there any way to improve this?

Results and test code follow.

Running on CentOS 7 (linux 3.10.0), default compiler (gcc 4.8.5), reading 29GB file from a RAID array of regular disks.

Running with /usr/bin/time -v:

4KB memcpy:

User time (seconds): 5.43
System time (seconds): 10.18
Percent of CPU this job got: 75%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.59
Major (requiring I/O) page faults: 4607
Minor (reclaiming a frame) page faults: 7603470
Voluntary context switches: 61840
Involuntary context switches: 59

1MB memcpy:

User time (seconds): 6.75
System time (seconds): 8.39
Percent of CPU this job got: 23%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:03.71
Major (requiring I/O) page faults: 302965
Minor (reclaiming a frame) page faults: 7305366
Voluntary context switches: 302975
Involuntary context switches: 96

MADV_WILLNEED did not seem to have much impact on the 1MB copy result.

MADV_SEQUENTIAL slowed down the 1MB copy result by so much, I didn't wait for it to finish (at least 7 minutes).

MAP_POPULATE slowed the 1MB copy result by about 15 seconds.

Simplified code used for the test:

#include <algorithm>
#include <iostream>
#include <stdexcept>

#include <fcntl.h>
#include <stdint.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>

int
main(int argc, char *argv[])
{
  try {
    char *filename = argv[1];

    int fd = open(filename, O_RDONLY);
    if (fd == -1) {
      throw std::runtime_error("Failed open()");
    }

    off_t file_length = lseek(fd, 0, SEEK_END);
    if (file_length == (off_t)-1) {
      throw std::runtime_error("Failed lseek()");
    }

    int mmap_flags = MAP_PRIVATE;
#ifdef WITH_MAP_POPULATE
    mmap_flags |= MAP_POPULATE;  // Small performance degredation if enabled
#endif

    void *map = mmap(NULL, file_length, PROT_READ, mmap_flags, fd, 0);
    if (map == MAP_FAILED) {
      throw std::runtime_error("Failed mmap()");
    }

#ifdef WITH_MADV_WILLNEED
    madvise(map, file_length, MADV_WILLNEED);    // No difference in performance if enabled
#endif

#ifdef WITH_MADV_SEQUENTIAL
    madvise(map, file_length, MADV_SEQUENTIAL);  // Massive performance degredation if enabled
#endif

    const uint8_t *file_map_i = static_cast<const uint8_t *>(map);
    const uint8_t *file_map_end = file_map_i + file_length;

    size_t memcpy_size = MEMCPY_SIZE;

    uint8_t *buffer = new uint8_t[memcpy_size];

    while (file_map_i != file_map_end) {
      size_t this_memcpy_size = std::min(memcpy_size, static_cast<std::size_t>(file_map_end - file_map_i));
      memcpy(buffer, file_map_i, this_memcpy_size);
      file_map_i += this_memcpy_size;
    }
  }
  catch (const std::exception &e) {
    std::cerr << "Caught exception: " << e.what() << std::endl;
  }

  return 0;
}

@AjayBrahmakshatriya Are you saying that you're seeing different numbers on your system? — that other guy, Oct 16 '18 at 23:53
@AjayBrahmakshatriya The number of page faults in each execution is given in the post. It's 4607 vs 302965. — that other guy, Oct 16 '18 at 23:58
Interesting, the large memcpy took less combined (user+system) CPU time, but had a ton more context switches, presumably being responsible for it being so much slower on the wall clock? — xaxxon, Oct 17 '18 at 00:02
Is it possible that the smaller memcpy was able to trigger multiple forward-looking disk reads and get the number of pending reads high enough that the disk was more saturated? Do you have any disk command queue and read throughput stats for the disk while it's running? Also, what RAID config is the disk in? Not sure it matters, but I'm curious.. 0? 5? 50? etc — xaxxon, Oct 17 '18 at 00:04
Also, when I build this with optimizations turned on, the memcpy is optimized out completely since it results in no observable changes -- https://godbolt.org/z/zGdG1w what compiler and flags are you using to build this? adding something silly after the memcpy puts it back in, though: volatile uint8_t c = buffer[10]; https://godbolt.org/z/ctTRPV — xaxxon, Oct 17 '18 at 00:25
@xaxxon I'm just using gcc 4.8.5 with -O3 plus any required defines (e.g. MEMCPY_SIZE). If I change to that configuration on Compiler Explorer, it seems to still include the memcpy with no changes. The system is using hardware RAID (MegaRaid SAS) in a RAID-6 configuration. atop shows an average queue depth of ~1 and ~0.14ms per request for the 1MB copy and queue depth of ~1.3 and ~0.07ms per request for the 4KB copy. — Alex, Oct 17 '18 at 12:55
@thatotherguy The tests were run multiple times and while the figures weren't exactly the same each time, they were always similar with a substantial difference in wall clock time and major page faults. — Alex, Oct 17 '18 at 12:58
@SeverinPappadeux `perf stat` showed page-faults to be the same on the 2 different memcpy sizes (it seems to be the same as "minor page faults" in `/usr/bin/time -v`), but shows context switches to be ~4 times as many with the 1MB copy. — Alex, Oct 17 '18 at 18:32
@Alex basically it looks like problem with allocating 1MB chunk. It is allocated as anonymous or /dev/zero mmap, so it is causing a lot more pagefaults. It is useless in this situation because source is mmaped already. You could try to check this theory either by disabling mmap in malloc (https://www.gnu.org/software/libc/manual/html_node/The-GNU-Allocator.html), or running strace to get syscalls. — Severin Pappadeux, Oct 17 '18 at 19:07
@SeverinPappadeux Thanks for the idea, but this change doesn't seem to have made a noticeable difference to the performance. I verified with strace that it was currently using mmap for the 1MB allocation (it was, as you expected), then ran with the environment variable MALLOC_MMAP_MAX_=0 and confirmed it was no longer using mmap for the 1MB allocation (it wasn't, there were "brk" system calls instead). It didn't improve the performance or the number of major page faults though, unfortunately. — Alex, Oct 18 '18 at 14:27
@Alex Interesting, very interesting. Let me think about it a bit more — Severin Pappadeux, Oct 18 '18 at 14:54
memcpy uses different algorithm depending on the size. A larger sized copy will avoid using the L2/L3 cache. Maybe that's the cause of the difference. Btw., if you memcpy a large portion, you should not use mmap. Use `read` instead. — geza, Nov 04 '18 at 23:16
@Alex - (a) how much physical memory is there on the machine? (b) how much swap space?, and (c) any difference in performance if you add `MAP_NORESERVE` to `mmap_flags`? — cegfault, Nov 05 '18 at 22:23
@geza It might be the issue, I don't really know. This example is a lot simpler than the real code, it just exhibits the same issue. Using mmap makes it simpler to handle the other logic. I have also tried a version with vectored IO instead of mmap (the real application handles a "header" structure and "payload" for each block, and the payload has to be copied to a specific location while the header just needs to be read by the application), and it performs similarly to the 4KB mmap copy version, while making the code a lot less readable due to handling of partial reads with vectored I/O. — Alex, Nov 06 '18 at 10:45
@cegfault (a) 64GB of memory -- nothing much is running while I'm doing these tests. (b) It's never gone in to swap. (c) Thanks for the suggestion, but that didn't make a difference. I'm surprised that most of the flags seem to make things worse -- particularly MADV_SEQUENTIAL which made it crawl, even though this is very sequential. I'd read using MADV_SEQUENTIAL and MAP_POPULATE together would hint to read ahead, which I thought would help. — Alex, Nov 06 '18 at 11:04
This is a really interesting problem; the performance for me between 1m and 4k is negligible until I hit about 8GB, which is the amount of ram on my current machine. It appears mmap works differently (or, more accurately, vm_allocate in the kernel) on huge sizes, and differently again on sizes larger than physical memory. Since your 29GB file is smaller than 64GB ram, we can rule out the kernel's handling of values larger than memory. That said, there is another cutoff somewhere where performance magically drops, and I don't know if that's machine independent. — cegfault, Nov 06 '18 at 17:33
It might be helpful if you ran this code against file sizes from 1GB on up to 30GB so we can see if/where there's a change in performance. — cegfault, Nov 06 '18 at 17:34
The memcpy implementation used in Centos 7 seems to use a non linear memory access pattern, which non optimal for mmapped files. Could you try to use a simple handwritten implementation. For benchmarking use a separate cpp-File to avoid inlining. — crash, Nov 08 '18 at 08:59
Can you put here the disassembly of memcpy? Just to confirm my (and crash's) theory. — geza, Nov 08 '18 at 14:20

score 1 · Answer 1 · answered Nov 08 '18 at 14:57

If the underlying file and disk systems aren't fast enough, whether your use mmap() or POSIX open()/read() or standard C fopen()/fread() or C++ iostream won't matter much at all.

If performance really matters and the underlying file and disk system(s) are fast enough, though, mmap() is probably the worst possible way to read a file sequentially. The creation of mapped pages is a relatively expensive operation, and since each byte of data is read only once that cost per actual access can be extreme. Using mmap() can also increase memory pressure on your system. You can explicitly munmap() pages after you read them, but then your processing can stall while the mappings are torn down.

Using direct IO will probably be the fastest, especially for large files as there's not a massive number of page faults involved. Direct IO bypasses the page cache, which is a good thing for data read only once. Caching data read only once - never to be reread - is not only useless but potentially counterproductive as CPU cycles get used to evict useful data from the page cache.

Example (headers and error checking omitted for clarity):

int main( int argc, char **argv )
{
    // vary this to find optimal size
    // (must be a multiple of page size)
    size_t copy_size = 1024UL * 1024UL;

    // get a page-aligned buffer
    char *buffer;
    ::posix_memalign( &buffer, ( size_t ) ( 4UL * 1024UL ), copy_size );

    // make sure the entire buffer's virtual-to-physical mappings
    // are actually done (can actually matter with large buffers and
    // extremely fast IO systems)
    ::memset( buffer, 0, copy_size );

    fd = ::open( argv[ 1 ], O_RDONLY | O_DIRECT );

    for ( ;; )
    {
        ssize_t bytes_read = ::read( fd, buffer, copy_size );
        if ( bytes_read <= 0 )
        {
            break;
        }
    }

    return( 0 );
}

Some caveats exist when using direct IO on Linux. File system support can be spotty, and implementations of direct IO can be finicky. You probably have to use a page-aligned buffer to read data in, and you may not be able to read the very last page of the file if it's not a full page.

Improving mmap memcpy file read performance

1 Answers1

Linked