2

I have application that write huge amount of data in file using ofstream.

Today by chance, I found several examples like this one:

const size_t bufsize = 256*1024;
char buf[bufsize];
mystream.rdbuf()->pubsetbuf(buf, bufsize);

What is the original value? 4KB? 16KB? Any method I can find it?

What is optimal value we can use here? 256KB? 1MB? What if we can spare 1GB?

Nick
  • 9,962
  • 4
  • 42
  • 80
  • 3
    I am not certain, but for boost::asio the best buffer size was one matching the DMA block size of the device (disk/network adapter) used. Because then boost will fall back on DMA transfer (which lowers CPU load too). I don't know if this also applies to streams. So if you really have a bottlneck you might check boost::asio out too. Tip make some standalone test app first and try various sizes to measure throughput speeds with different buffer sizes (streams and boost::asio) and possibly a profiler. – Pepijn Kramer Jun 10 '23 at 11:15
  • The point might be mood. If you `stream.write` large blocks, data will not be put into that buffer anyway, writing straight to the OS page cache instead. And if your blocks are too small, especially with formatted output, the per-call overhead will bottleneck you before the disk does (remember each call is a mutex lock/unlock). That being said, buffers of about 1 MiB work fine. Especially on network filesystems you want large buffers. But you definitely want to benchmark this – Homer512 Jun 10 '23 at 12:38
  • it writes mostly small binary blocks mostly 100-200 bytes, but do it fast one after the other. – Nick Jun 10 '23 at 13:43
  • @Nick and how much data do you write overall? – Homer512 Jun 10 '23 at 14:09
  • 4-8 GB, sometimes 30 GB, minimum is 200-300 MB but in rare cases – Nick Jun 10 '23 at 14:21

1 Answers1

2

First, let's discuss what's actually going on. What we are controlling are mostly some memcpys.

  1. We fill our own data structure
  2. ofstream copies that structure into its internal buffer
  3. The kernel copies that buffer into the page cache
  4. The disk subsystem reads the page cache via DMA

If the data structure is larger than the buffer, step 2 is skipped. Step 2 also requires locking and unlocking of a mutex unless you keep it locked via C++20's osyncstream.

In my experience step 2 can have significant overhead. So what you can do is artificially increase the size of step 1 by buffering multiple smaller write requests. Here is a simple benchmark to test this:

#include <cstdlib>
#include <fstream>
#include <iostream>
#include <memory>
#include <random>
#include <vector>


int main(int argc, char** argv)
{
  if(argc != 5) {
    std::cerr << "Usage: " << (argc ? argv[0] : "binary")
              << " filename filesize filebuffer membuffer\n";
    return 1;
  }
  const char* filename = argv[1];
  /* Total file size to hit or exceed */
  unsigned long long filesize = std::strtoull(argv[2], nullptr, 10);
  /*
   * Size of the ofstream-internal buffer
   * Setting this to 0 uses the platform-default
   */
  unsigned long long filebufsize = std::strtoull(argv[3], nullptr, 10);
  /*
   * Number of bytes to buffer before calling into ofstream
   * Setting this to 0 is equivalent to calling ofstream directly
   */
  unsigned long long membufsize = std::strtoull(argv[4], nullptr, 10);

  auto filebuf = std::make_unique<char[]>(filebufsize);
  std::ofstream out(filename);
  if(filebufsize > 0)
    out.rdbuf()->pubsetbuf(filebuf.get(), filebufsize);
  std::default_random_engine rng;
  // 100-200 bytes at once
  std::uniform_int_distribution<std::size_t> len_distr(100, 200);
  std::vector<char> membuf;
  for(std::size_t written = 0; written < filesize; written += membuf.size()) {
    membuf.clear();
    do {
      /* Simulates buffering multiple data blocks before calling ofstream */
      std::size_t blocksize = len_distr(rng);
      membuf.resize(membuf.size() + blocksize);
    } while(membuf.size() < membufsize);
    out.write(membuf.data(), membuf.size());
  }
}

And here is a bash script to run parameter combinations with the default values and buffer sizes between 4 kiB and 1 GiB.

#!/bin/bash

FILE=/dev/null
# 10 GiB
FILESIZE=$((10*1024**3))

run() {
    local filebuf="$1"
    local membuf="$2"
    echo "$filebuf $membuf"
    time -p ./a.out "$FILE" "$FILESIZE" "$filebuf" "$membuf"
}

# warmup. ignore first run
run 0 0
run 0 0
for((membuf=4096; membuf<=$((1024**3)); membuf*=2)); do
    run 0 $membuf
done
for((filebuf=4096; filebuf<=$((1024**3)); filebuf*=2)); do
    run $filebuf 0
    for((membuf=4096; membuf<filebuf; membuf*=2)); do
        run $filebuf $membuf
    done
done

/dev/null test

My hypothesis is that the first 3 steps are fastest if the buffer sizes are below the level 2 cache size so that memory bandwidth is maximized. I've tested this on a threadripper CPU. The first two test runs show this result:

0 0
real 3,99
user 3,79
sys 0,20
0 4096
real 1,72
user 1,33
sys 0,38

"0 0" means we simply call ofstream::write with 100-200 bytes at a time, no further changes. "0 4096" means we buffer about 4 kiB of data before calling ofstream::write. This already cuts the runtime in half! The overhead of ofstream is significant. I will not show all data. Larger vector sizes show relatively flat performance between 64 kiB and 2 MiB. In this particular run, the best performance was 1.25 seconds with 256 kiB. Larger sizes deteriorate performance as expected.

0 131072
real 1,29
user 1,28
sys 0,01
0 262144
real 1,25
user 1,24
sys 0,00
0 524288
real 1,29
user 1,28
sys 0,00
0 1048576
real 1,25
user 1,24
sys 0,00

[...]

0 536870912
real 1,46
user 1,36
sys 0,09
0 1073741824
real 1,57
user 1,40
sys 0,17

Increasing the ofstream buffer size to similar levels have no positive effect, e.g.

262144 0
real 3,81
user 3,64
sys 0,16
262144 4096
real 1,71
user 1,39
sys 0,32

All other changes basically verify these trends. The worst performance happens when increasing the ofstream buffer to 1 GiB without using larger memory buffers

1073741824 0
real 4,15
user 3,83
sys 0,32

Although not shown, tests on a tmpfs RAM disk have similar performance numbers, just with higher SYS load.

Real file system test

In a second test, I changed the file to an Ext4 filesystem on an NVME SSD. Here the higher overhead of the ofstream doesn't matter because it can still outperform the SSD. 33 seconds for 10 GiB means we get about 310 MiB/s. We still save CPU time by pre-buffering, though.

0 0
real 33,46
user 3,77
sys 6,49
0 4096
real 33,27
user 1,47
sys 7,32

Beyond that, there really isn't much to see and I aborted the test before finishing the 200 MiB block size.

Other aspects

Performance figures might look very differently if you run on a network or cluster filesystem. Those tend to favor larger block sizes but that might be more important for reading than writing to reduce the number of network roundtrips. Multithreading also helps keeping every component of data transfer busy at all times.

On faster local filesystems like RAIDs of high performance U2 SSDs or large RAID6 HDD arrays, I find that normal page-cached IO cannot exhaust disk bandwidth. In these cases I switch over to direct IO, about 1 MiB size per block, maybe overlapping 4 blocks (asynchronous IO, threads, Windows overlapping IO). However, if your total write size is smaller than main memory, you might still want to accept the slower page-cached write performance in exchange for keeping the data in cache for reading.

Conclusion

For anything you might find on a regular old desktop system, don't bother with ofstream buffer sizes. Buffer outside the ofstream for a few hundred kiB or wait until C++20 is widely available on all platforms that you want to support, then try osyncstream.

Homer512
  • 9,144
  • 2
  • 8
  • 25
  • thanks a lot. I know this is already answered, but if step 2 is expensive, will switching to good old C FILE* will work faster? it is also buffered. – Nick Jun 10 '23 at 19:39
  • @Nick, nope, `FILE` works the same. It helps with formatted IO because the lock isn't released within a single `printf` and that can do more than an `ostream` in a single `<<` call. Glibc offers [unlocked stdio](https://man7.org/linux/man-pages/man3/fileno_unlocked.3.html) but most of it is non-standard – Homer512 Jun 10 '23 at 19:51
  • I saw example with nullptr for buffer, but I think non-buffered will be definitely slower – Nick Jun 10 '23 at 19:54
  • @Nick you mean `pubsetbuf(nullptr, 0)`? Yeah, that is usually not a good idea. The `ostream` will simply not use its internal buffer when the input size is larger than the buffer. But be careful, [`setvbuf(nullptr, nbytes)`](https://www.man7.org/linux/man-pages/man3/setvbuf.3p.html) simply lets the `FILE` allocate the buffer – Homer512 Jun 10 '23 at 19:59
  • @Nick but don't mix up direct IO with unbuffered IO. Those are very different things – Homer512 Jun 10 '23 at 20:00