0

I am trying to implement fast processing of large files using the Visual Studio 2019. Data should be read, processed and then written to the end of the same file. After making some tests, I found that a file buffer of 1MB seems to be a best option on my hardware.

Here, I'm trying to set it to 1MB:

#include <fstream>
#include <array>
#include <memory>

using namespace std;

int main()
{
    const streamsize BUFFER_SIZE = 1 * 1024 * 1024;
    unique_ptr<::array<char, BUFFER_SIZE>> buffer = make_unique<::array<char, BUFFER_SIZE>>();

    const streamsize FILE_BUFFER_SIZE = 1 * 1024 * 1024;
    unique_ptr<::array<char, FILE_BUFFER_SIZE>> file_buffer = make_unique<array<char, FILE_BUFFER_SIZE>>();

    ios::sync_with_stdio(false);

    fstream stream;
    stream.rdbuf()->pubsetbuf(file_buffer->data(), file_buffer->size());
    stream.open(R"(C:\test\test_file.bin)", ios::in | ios::out | ios::binary);

    while (stream.good())
    {
        stream.read(buffer->data(), buffer->size());

        // Some data processing and writes here
    }   
}

While monitoring the program using the Sysinternals' ProcessMonitor, I can see that the WriteFile function is called with 1MB buffer indeed, but the ReadFile function is called 256 times for one loop iteration with only a 4K buffer. This leads to a much worse performance.

I've googled this problem and found no similar cases. I would appreciate any help on this.

DavidZi
  • 305
  • 2
  • 7

2 Answers2

1

The behaviour of setbuf isn't very well specified: https://en.cppreference.com/w/cpp/io/basic_filebuf/setbuf

According to cppreference (which matches my experience) libstdc++ only uses the buffer if you call pubsetbuf before opening the file, visual studio only uses the buffer if passed after opening the file. Therefore for cross platform code which has a resonable chance (but no guarantee) of using your buffer you should do:

fstream stream;
stream.rdbuf()->pubsetbuf(file_buffer->data(), file_buffer->size());
stream.open(R"(C:\test\test_file.bin)", ios::in | ios::out | ios::binary);
stream.rdbuf()->pubsetbuf(file_buffer->data(), file_buffer->size());

Also note you don't need to actually supply a buffer to pubsetbuf, you can just pass a null pointer:

fstream stream;
stream.rdbuf()->pubsetbuf(nullptr, BUFFER_SIZE);
stream.open(R"(C:\test\test_file.bin)", ios::in | ios::out | ios::binary);
stream.rdbuf()->pubsetbuf(nullptr, BUFFER_SIZE);

If you want to target libstdc++ in the future it is also worth noting that your buffer size needs to be 1 larger than your desired size.

boost::iostreams gives you a little more direct control over buffer sizes.

Alan Birtles
  • 32,622
  • 4
  • 31
  • 60
  • Thank you so much, it works! And thanks for details on pubsetbuf(), I learned a lot. – DavidZi Dec 03 '19 at 16:57
  • "`pubsetbuf(nullptr, BUFFER_SIZE)`" - does that have the desired effect? [pubsetbuf docs](https://en.cppreference.com/w/cpp/io/basic_streambuf/pubsetbuf) say it "Calls `setbuf`" in most cases, whose [docs](https://en.cppreference.com/w/cpp/io/c/setbuf) say "If `buffer` is null, equivalent to `std::setvbuf(stream, nullptr, _IONBF, 0)", which **turns off buffering**". In that case, `BUFFER_SIZE` would be ignored, right? – nh2 Jul 13 '22 at 19:31
  • @nh2 No, it calls the setbuf virtual function not the global setbuf function. For file streams this is defined as https://en.cppreference.com/w/cpp/io/basic_filebuf/setbuf – Alan Birtles Jul 13 '22 at 20:35
  • @AlanBirtles Thanks. But I still cannot conclude from there that `nullptr, BUFFER_SIZE` is supposed to work. The page mentiones the special case that "`s` is a null pointer and `n` is zero", and "Otherwise [..] replaces the internal buffer [..] with the user-supplied character array whose first element is pointed to by s", which would be the `nullptr`. – nh2 Jul 15 '22 at 12:48
  • Yes, its poorly specified, as stated on cppreference the only thing guaranteed by the standard is that `0,0` disables buffering. In practice the common implementations are as stated in this answer/on cppreference – Alan Birtles Jul 15 '22 at 12:53
0

What you probably want is a memory mapped file, which is cached. You work against the buffered version of the file in memory, and it is eventually synchronized with the actual disk.

Here is a similar question answered., Is there a memory mapping api on windows platform, just like mmap() on linux?

Joshua Clayton
  • 1,669
  • 18
  • 29