boost file_mapping performance

Question

I wrote a small test to compare file writing operations between boost file_mapping and std::ofstream. I was under the impression that file_mapping performance would be superior but it is apparently not the case.

Can someone explain why I would get better numbers with std::ofstream?

[EDIT]: So I did a profiling of my benchmark test and noticed that boost::iostreams::detail::direct_streambuf was spending lots of time copying bytes. I've added a new test which is using std::copy_n instead of ostream.write. The performance seems much better now. I have also updated the test code to compare with different file size.

The boost iostream direct_streambuf is really struggling on high volumes compared to std::copy_n. I'd like to find a better alternative instead as my app is based on ostream and I can't afford the refactoring.

#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>
#include <boost/iostreams/device/array.hpp>
#include <boost/iostreams/stream.hpp>
#include <vector>
#include <chrono>
#include <iostream>
#include <fstream>

int test_mapped_file_ostream(size_t TOTAL_SIZE, size_t BLOCK_SIZE, size_t N)
{
    const std::string filename = "test_filemapping.dat";
    boost::interprocess::file_mapping::remove(filename.data());

    {
    std::ofstream file(filename, std::ios::binary | std::ios::trunc);
    file.seekp(static_cast<std::streamoff>(TOTAL_SIZE-1));
    file.write("", 1);
    }

    std::chrono::system_clock::time_point start;
    std::chrono::system_clock::time_point end;
    {
        boost::interprocess::file_mapping fmap(filename.data(), boost::interprocess::read_write);
        boost::interprocess::mapped_region mreg(fmap, boost::interprocess::read_write);
        mreg.advise( boost::interprocess::mapped_region::advice_sequential );

        std::shared_ptr<std::streambuf> buf( new boost::iostreams::stream_buffer<boost::iostreams::array_sink>((char*)(mreg.get_address()), mreg.get_size()));
        std::ostream ostream( buf.get() );

        const std::vector<char> data(BLOCK_SIZE,1);

        start=std::chrono::system_clock::now();     
        for ( size_t i=0; i<N; i++ ) {
            ostream.write( data.data(), data.size() );
        }
        end=std::chrono::system_clock::now();       
    }

    auto total = end-start;
    std::cout << "test_mapped_file_ostream (ms): " << std::chrono::duration_cast<std::chrono::milliseconds>(total).count() << std::endl;

    return 0;
}

int test_mapped_file_stdcopy_n(size_t TOTAL_SIZE, size_t BLOCK_SIZE, size_t N)
{
    const std::string filename = "test_filemapping_stdcopy.dat";
    boost::interprocess::file_mapping::remove(filename.data());

    {
    std::ofstream file(filename, std::ios::binary | std::ios::trunc);
    file.seekp(static_cast<std::streamoff>(TOTAL_SIZE-1));
    file.write("", 1);
    }

    std::chrono::system_clock::time_point start;
    std::chrono::system_clock::time_point end;
    {
        boost::interprocess::file_mapping fmap(filename.data(), boost::interprocess::read_write);
        boost::interprocess::mapped_region mreg(fmap, boost::interprocess::read_write);
        mreg.advise( boost::interprocess::mapped_region::advice_sequential );

        char* regptr = (char*)mreg.get_address();
        const std::vector<char> data(BLOCK_SIZE,1);

        start=std::chrono::system_clock::now();     
        for ( size_t i=0; i<N; i++ ) {
            std::copy_n( data.data(), data.size(), regptr );
            regptr += data.size();
        }
        end=std::chrono::system_clock::now();       
    }

    auto total = end-start;
    std::cout << "test_mapped_file_stdcopy_n (ms): " << std::chrono::duration_cast<std::chrono::milliseconds>(total).count() << std::endl;

    return 0;
}

int test_fstream_file(size_t TOTAL_SIZE, size_t BLOCK_SIZE, size_t N)
{
    const std::string filename = "test_fstream.dat";

    std::chrono::system_clock::time_point start;
    std::chrono::system_clock::time_point end;
    {
        const std::vector<char> data(BLOCK_SIZE,1);
        std::ofstream file(filename, std::ios::binary | std::ios::trunc);
        start=std::chrono::system_clock::now();     
        for ( size_t i=0; i<N; i++ ) {
            file.write( data.data(), data.size() );
        }
        end=std::chrono::system_clock::now();       
    }
    auto total = end-start;
    std::cout << "test_fstream_file (ms): " << std::chrono::duration_cast<std::chrono::milliseconds>(total).count() << std::endl;

    return 0;
}

int main(int argc, char **argv)
{
    if ( argc != 2 ) {
        std::cout << "Usage: " << argv[0] << " <size of output file in gigabytes>" << std::endl;
        exit(1);
    }

    uint64_t totalsize = std::stoull(argv[1]);
    if (totalsize==0) {
        totalsize = 1;
    }

    const std::size_t GB = (uint64_t)1 << 30; 
    const std::size_t TOTAL_SIZE = totalsize << 30; 
    const std::size_t BLOCK_SIZE = (uint64_t)1 << 20;
    const std::size_t N = TOTAL_SIZE/BLOCK_SIZE;

    std::cout << "TOTAL_SIZE (GB)=" << TOTAL_SIZE/GB << std::endl;
    test_mapped_file_ostream(TOTAL_SIZE,BLOCK_SIZE,N);
    test_mapped_file_stdcopy_n(TOTAL_SIZE,BLOCK_SIZE,N);
    test_fstream_file(TOTAL_SIZE,BLOCK_SIZE,N);
    return 0;
}

results: Windows 7, HHD, 64gb RAM

Performance ratios compared to fstream.write in (ms):

TOTAL_SIZE (GB)=5
test_mapped_file_ostream (ms): 24610 (-1.88x)
test_mapped_file_stdcopy_n (ms): 3307 (3.9x)
test_fstream_file (ms): 13052

TOTAL_SIZE (GB)=10
test_mapped_file_ostream (ms): 49524 (-1.3x)
test_mapped_file_stdcopy_n (ms): 6610 (5.8x)
test_fstream_file (ms): 38219

TOTAL_SIZE (GB)=15
test_mapped_file_ostream (ms): 85041 (1.52x)
test_mapped_file_stdcopy_n (ms): 12387 (10.5x)
test_fstream_file (ms): 129964

TOTAL_SIZE (GB)=20
test_mapped_file_ostream (ms): 122897 (1.7x)
test_mapped_file_stdcopy_n (ms): 17542 (12.2x)
test_fstream_file (ms): 213697

Profiling

I got test_filemapping.dat: 48440 test_fstream.dat: 63071. You'd have to profile deeper. What system calls are being issued? How does your OS process them? There is not one best way to do I/O. — Cubbi, Oct 24 '15 at 14:53
One thing that seems wrong is that you don't close either accessor object before taking the time, which basically means that any time it takes to flush buffers is not included in your measures. Nitpick: M is for 1,000,000, you mean Mi for 0x100000, ditto for G vs Gi as prefix, you also don't use ms to mean 1/1024th of a second. — Ulrich Eckhardt, Oct 24 '15 at 15:35

sehe · Answer 1 · 2015-10-24T21:48:49.830

You're using a text oriented ostream anyways. This is going to account for a large portion of the time taken formatting to the stream.

Other than that consider madvising for sequential access.

Finally profile to find your bottle necks

I've hit this problem with all the tricks I know and came up with the following really bare-bones POSIX mmap vs.write comparison.

I used madvise and fadvise with SEQUENTIAL|WILL_NEED where applicable, and made sure that sparseness wasn't a cause for slowness.

The short summary of it all is:

your code could really be a lot simpler (see rev. 176f546ea8f65050c)
the maps are fast for smaller volumes
the buffering is probably what makes the streambased implementations shine instead of the mmap based ones

Live On Coliru

#include <boost/chrono.hpp>
#include <boost/chrono/chrono_io.hpp>
#include <iostream>
#include <vector>
#include <algorithm>

// mmap the manual way
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#ifndef COLIRU
const std::size_t TOTAL_SIZE = 5ul << 30;
const std::size_t BLOCK_SIZE = 1ul << 20;
#else
const std::size_t TOTAL_SIZE = 1ul << 20;
const std::size_t BLOCK_SIZE = 1ul <<  9;
#endif
static_assert(0 == TOTAL_SIZE%BLOCK_SIZE, "not divisable by block size");
const int N = TOTAL_SIZE/BLOCK_SIZE;

template <typename Caption, typename F>
auto timed(Caption const& task, F&& f) {
    using namespace boost::chrono;
    struct _ {
        high_resolution_clock::time_point s;
        Caption const& task;
        ~_() { std::cout << " -- (" << task << " completed in " << duration_cast<milliseconds>(high_resolution_clock::now() - s) << ")\n"; }
    } timing { high_resolution_clock::now(), task };

    return f();
}

void test_mapped_file() {
    std::vector<char> const data(BLOCK_SIZE, 1);
    const std::string filename = "test_filemapping.dat";

    std::remove(filename.c_str());

    int fd = open(filename.c_str(), O_RDWR|O_CREAT, 0644);

    if (fd==-1) {
        perror("open");
        exit(255);
    }

    if(posix_fallocate64(fd, 0, TOTAL_SIZE)) {
        perror("fallocate64");
        exit(255);
    }

    posix_fadvise64(fd, 0, TOTAL_SIZE, POSIX_FADV_WILLNEED | POSIX_FADV_SEQUENTIAL);

    char* fmap = static_cast<char*>(mmap64(nullptr, TOTAL_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0));

    if (!fmap || fmap == MAP_FAILED) {
        perror("mmap");
        exit(255);
    }

    madvise(fmap, TOTAL_SIZE, MADV_SEQUENTIAL | MADV_WILLNEED);

    timed(filename, [output=fmap, &data] () mutable {
        for (size_t i = 0; i < N; i++) {
            std::copy_n(data.data(), data.size(), output);
            output += data.size();
        }
    });

    munmap(fmap, TOTAL_SIZE);
    close(fd);
}

void test_posix_write() {
    std::vector<char> const data(BLOCK_SIZE, 1);
    const std::string filename = "test_posix.dat";

    std::remove(filename.c_str());

    int fd = open(filename.c_str(), O_RDWR|O_CREAT, 0644);

    if (fd==-1) {
        perror("open");
        exit(255);
    }

    posix_fadvise64(fd, 0, TOTAL_SIZE, POSIX_FADV_WILLNEED | POSIX_FADV_SEQUENTIAL);

    timed(filename, [&] () mutable {
        for (size_t i = 0; i < N; i++) {
            ptrdiff_t count = ::write(fd, data.data(), data.size());
            if (-1 == count) { 
                perror("write");
                exit(255);
            }
            assert(count == BLOCK_SIZE);
        }
    });

    close(fd);
}

int main() {
    test_mapped_file();
    test_posix_write();
}

When tested on Coliru prints:

./a.out; md5sum *.dat
 -- (test_filemapping.dat completed in 0 milliseconds)
 -- (test_posix.dat completed in 8 milliseconds)
d35bb2e58b602d94ccd9628f249ae7e5  test_filemapping.dat
d35bb2e58b602d94ccd9628f249ae7e5  test_posix.dat

Run locally (5GiB volumes):

$ ./test
 -- (test_filemapping.dat completed in 1950 milliseconds)
 -- (test_posix.dat completed in 1307 milliseconds)

Did some detailed review, refactorings and benchmarks. See recorded live-stream [part 1](https://www.livecoding.tv/video/mmap-vs-stream-io-benchmark-part-1/) and [part 2](https://www.livecoding.tv/video/mmap-vs-stream-io-benchmark-part-2/) — sehe, Oct 24 '15 at 21:50
Found that boost::iostreams::detail::direct_streambuf was spending lots of time copying bytes. With std::copy_n the performance is nearly 4x better over ofstream::write. — mab13, Oct 26 '15 at 19:42

score 2 · Answer 2 · edited May 23 '17 at 11:46

2

You might be experiencing thrashing, which will significantly slow down the time needed to write the file via memory mapping. Your benchmark writes out nearly 5 gigabytes of data. If you do not have 5 gigabytes of RAM available, then the OS will be busy exchanging dirtied pages in memory for data on disk.

As sehe suggested, you might consider madvising because in this scenario, you are sequentially accessing the memory-mapped file:

mreg.advise( boost::interprocess::mapped_region::advice_sequential );

However, note that this will not fix thrashing problems.

edited May 23 '17 at 11:46

Community

1
1

answered Oct 24 '15 at 21:40

Daniel Trebbien

38,421
18
121
193

My updated answer shows that `madvise`/`fadvise` have negligible impact here. Nice to know that I could have done the `advise` with the Boost interface though :) – sehe Oct 24 '15 at 21:51

boost file_mapping performance

2 Answers2

Linked