2

I'm asking this question because I have been working on a project that requires collecting a lot of data REALLY fast, depending on the scenario. 5.7GBytes with a capital BYTE per second or 11.4GBytes per second.

We are working with a small striped raid array using 3 Samsung Pro NVME (for 11.4GB/s we have a larger array).

Currently, the project has been developed on Windows, I wanted to make things as portable as possible so I focused on using C++ Standard Library; however, no matter what I did I could not crack transferring files faster than 1.5GB/s

The strategy was simple to create a couple of huge swap buffers, and write them directly to disk as a huge unformatted binary file.

Using std::ofstream and benchmarking manually setting varied buffer sizes through:

rdbuf()->pubsetbuf(buffer, BUFFER_SIZE);
open(Filename, std::ios::binary|std::ios::trunc);

followed by my managed write loop, I was able to find a sweet spot, but never able to crack 1.5GB/s

I then found the Windows SDK and its CreateFile function

In particular, the create file function using the FILE_FLAG_NO_BUFFERING flag.

This was a game-changer, as long as I made sure I fed it sector-aligned data (in my case everything needed to be some multiple of 512Bytes) I was suddenly able to take full advantage of the raid array throughput.

I revisited the std::ofstream function in an attempt to work with more OS-agnostic functions; however, even though one can specify zero buffer for std::ofstream, there doesn't appear to be any documentation with regards to any caveats to using that function with no buffer.

std::ofstream allows 64bit values for its write size, unlike Windows SDK WriteFile which only accepts DWORD's setting the maximum write size is the largest multiple of 512 one can squeeze into a uint32_t and you must manage your write in a loop if your file exceeds 4GB (mine do).

This just raises the question, is Microsoft simply not giving the C++ Standard Library Devs access to the necessary OS-level system calls to take advantage of Ultra-high-speed drive arrays? Or am I missing something in how to use the C++ Standard Library to its full potential?

Mgetz
  • 5,108
  • 2
  • 33
  • 51
Smeghead
  • 21
  • 3
  • 3
    Microsoft are implementing the C++ Standard Library for their compiler. It has the same access to the Win32 API as any other library. The issue isn't lack of access. The issue is that the C++ Standard Library is an abstraction that comes at a cost. That cost manifests itself both with respect to features as well as performance. If performance is what you need you're going to have to scrap cross platform support and opt out of what is ultimately the feature set exposed by POSIX. In case you are interested, Microsoft host their "STL" on GitHub. – IInspectable Jul 23 '21 at 08:04
  • 1
    That said, I find C++' I/O streams to be one of its worst features. They are clunky to use and conceptually follow the *"everything is a file"* mantra that's broken for so many reasons. There's a reason why `std::format` follows a very different approach, for example. – IInspectable Jul 23 '21 at 08:10
  • 1
    Github link: https://github.com/microsoft/STL/ . You might want to create an issue or even propose a patch. Though the performance of "one of worst C++ features" may have too low priority, definitely lower than, say, C++23 features or fixing C++20 defects. – Alex Guteniev Jul 23 '21 at 08:18
  • The "C++ Standard Library" in the end must use Win32 APIs to access files, including `CreateFile`, just check the MS STL implementation and see, it's open source. But block operations should generally be done as block, not as streams – phuclv Jul 23 '21 at 08:52
  • @AlexGuteniev the MS STL implements its `fstream` in terms of the c style standard io. So it would get kicked to windiv which owns the universal CRT IIRC. To the OP [a bug exists](https://github.com/microsoft/STL/issues/1107) – Mgetz Jul 23 '21 at 13:10

2 Answers2

3

"is Microsoft simply not giving the C++ Standard Library Devs..."

You might notice that the product you're using is called Microsoft Visual Studio. The Standard Library developers for Visual Studio work at Microsoft, although in a different team as the Windows developers.

The reason is a bit more simple: the Visual C++ devs can't possibly know and optimize for all possible use scenario's. It's a bit unusual to do text formatting at such high speeds. Remember, the point of ostream is to provide operator<<. ofstream is for formatted output to files. But for high-speed I/O you want binary output anyway.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • More importantly the entire MS `fstream` implementation is implemented in terms of C standard file io (fopen etc.) So any issues would lie in that code most likely anyway. But without profiling it's hard to tell. – Mgetz Jul 23 '21 at 13:07
  • `fstream` isn't slow because of *studio.h*. `fstream` is slow because of `fstream`. Once the Universal CRT is published on GitHub you can profile up some proof. – IInspectable Jul 23 '21 at 16:16
  • Thanks yes it was ofstream and I am dumping it directly to disk as a binary stream from ram. – Smeghead Jul 23 '21 at 22:55
2

To put it bluntly, the bandwidth you're aiming for are within the ballpark of the physical limits of current commodity hardware (~24GByte/s for 16×PCIe.4), and in my own work I found it very challenging to reach single-core memory transfer rates above 8GByte/s without the use of "dark magic" (aka hand crafted assembly and optimized system call code), and it involved carefully aligning the memory accesses and making use of vector extensions. But most importantly, to reach these levels of optimization requires to be aware of the kind of data that is being processed and what kind of access patters to expect and/or build caching intermediaries to accomodate for the underlying hardware.

Such optimizations are plain and simply outside of the scope of general purpose standard libraries. Standard libraries in their implementation must adhere to the behaviours written down in the specification, and some of these requirements tend to collide with what has to be done to make the most of the underlying hardware.

So I'm sorry to tell you, but you'll probably have to bite the bullet and use the low level system APIs directly, bypassing the standard library.

datenwolf
  • 159,371
  • 13
  • 185
  • 298
  • Hand-crafted assembly isn't needed with Visual Studio; the vector instructions are available as semi-standard intrinsics. But I agree that it's no longer Standard C++ as that point. – MSalters Jul 23 '21 at 08:32
  • @MSalters back when I was confrontend with the problem (I had to perform some unpacking and data conversion of a data stream arriving from a high speed ADC) I tried "everything": first I started with a variation of Duff's device, then added vector intrinsics, first just using vectorized data types, then making also use of the built-in vector functions. Eventually I arrived at hand crafting with the innermost hot loop with a lots of profiling and taking inspiration from the code a compiler would produce. – datenwolf Jul 23 '21 at 08:37
  • @MSalters the problem is, that the compiler doesn't know, how that data is going to be used in a "cousin" function down the line; usage of the `restrict` keyword to indicate where no aliasing is happening helps. But ultimately knowing that the data is going to arrive in chunks of 32kiB each, and being condensed down in a cousin function running in a separate thread is something the compiler can't see. The programmer can work the memory fencing knowing what's happening "on the other side". – datenwolf Jul 23 '21 at 08:41
  • I'm sure that windows Read/WriteFile api is approximately equivalent to the unix/linux read and write, which would have been my go to. I did find it fun making sure my buffers were sector aligned, and using the _aligned_malloc functions. But I would have hoped that we're moving to a more universal C++ which has access to some of the lower level system calls on all operating systems. – Smeghead Jul 23 '21 at 22:58
  • @Smeghead the problem is, that the C++ standard library has to work not only on Win32 and POSIX, but also on *every* other target out there. And some of the semantics of `iostream` require some additional work to be done on top of mere `ReadFile`/`WriteFile` / `read`/`write`. And it's this additional work required, that slows them down. Keep in mind that you're becoming aware of it, because you're so close to the theoretical limits of the hardware, so you're feeling all those little extra costs adding up. 99% of the other users won't notice it. – datenwolf Jul 24 '21 at 04:06
  • @MSalters: I think you misread me. Please read again – carefully – what I wrote. To clear it up, in simple english: Every conforming implementation of the C++ standard library must implement *everything* written down in the specification (the specification may leave things undefined). The implementation must happen on top of whatever the underlying OS APIs are. Some of the specified functions may not directly map to OS level APIs and as such need to be implemented as part of the standard library. Also the specification may define certain details in ways very different from how the OS APIs do. – datenwolf Jul 26 '21 at 09:01