Writing to files in Bulk vs Chunks

Question

Setup

With standard C code (= no platform specific code), I have written a program to do the following:

Get starting clock()
Open a file
write a ~250MB long string to it using one of the below listed modes
close the file.
Repeat 2...4 10000 times as fast as possible, rip storage unit
Get ending clock()
Do some time calculations and output

A) bulk mode: Write everything at once (= one call to fwrite)
B) chunk mode: Write string in chunks. One chunk is slightly more than 1MB. (= multiple calls to fwrite, about ~250).

Then, I let the program run on two different computers.

Expection

I expect A) being faster than B).

Results

Below was on my beefy PC with a Samsung 970 EVO M.2 SSD (CPU = AMD Ryzen 2700x: 8 cores / 16 threads). The output on this one is slightly wrong, it should've been Ns/file, not Ns/write)

Below was on my laptop. I don't really know what type of SSD is installed (and I don't bother too much to check it out). If it matters, or anyone wants to and knows how to research, the laptop is a Surface Book 3.

Conclusion

Beefy PC: B) is faster than A), against expectations.
Laptop: A) is faster than B), within expectations.

My best guess is that some sort of hidden parellization is at work. Either the CPU does smart things, the SSD does very smart things, or they work together to do incredibly smart things. But pinning and writing down anything further sounds too absurd for me to keep it staying here.

What explains the difference in my expectation and the results?

The benchmark

Check out https://github.com/rphii/Rlib, under examples/writecomp.c

More Text

I noticed this effect while working on my beefy PC with a string of length ~25MB. Since B) was a marginal, but consistent, ~4ms faster than A), I increased the string length and did a more thorough test.

Is not the benchmark program short enough to present in the question? Details of the implementation might factor in to the observed differences. — John Bollinger, Feb 09 '22 at 22:11
Disk IO might be buffered on multiple levels. It also depends on what else the PC is doing at the moment. Also you might be metering the buffer writes rather than the actual media access. — Eugene Sh., Feb 09 '22 at 22:12
@JohnBollinger sure, why not, I have it on github anyways. Edited in the link. — rphii, Feb 09 '22 at 22:18
Characteristics of the CPU caches could have something to do with it. — John Bollinger, Feb 09 '22 at 22:34
1) `clock` does not measure the wall clock time but the CPU time. Please read [this post](https://stackoverflow.com/questions/17432502). 2) Reads/writes are generally buffered. 3) Operating systems generally uses an in-memory cache (especially for HDD). 4) SSD reads can be faster in parallel (and often are for recent ones) while HDD are almost never faster in parallel. ([this quite recent post](https://stackoverflow.com/a/70911503/12939557) provides some information about caching and buffering). — Jérôme Richard, Feb 09 '22 at 22:41
Start by running a profile. It appears you do a lot of allocation and copying in your code. — wildplasser, Feb 09 '22 at 22:46
@JérômeRichard yea, the `clock` is really not what I desire to use, but I figured I'd just use that since I don't know anything better off the top of my head. Thanks for linking me to *that post*. — rphii, Feb 09 '22 at 22:55
@wildplasser I know. If you're talking about embedded, it isn't intended to be run on embedded. Next: Is that a bad thing? Sure. That's why I try to minimize it, to get a tradeoff between speed and memory allocation. (check the append functions, where I calculate the required amount...). Next: Yea, I could use a custom malloc and realloc etc, and when I have plenty of time left, I'll look into how this works (I already did a little bit). But for the time being, just let me use them, for gods sake. Unless you convince me to not use them? Or something better? We're talking about MB of strings... — rphii, Feb 09 '22 at 23:05
@wildplasser but you did trigger me to run a profile right now (on at least my laptop, since I have him on me right now) — rphii, Feb 09 '22 at 23:08
@wildplasser how deep are we talking? Bitlevel, like assembly? No, I don't. Why do you ask, I guess something about my code is off? I would appreciate it a ton if you could bestow your knowledge upon me... — rphii, Feb 09 '22 at 23:21
@wildplasser you know, instead of hinting at something with an open ended question, leaving me needing to unnecessarily dig everyhwere and you and me following each other up, you could at least guide me to someplace. Like a website as evidence to why or how the thing I'm doing is bad, how to improve it, your own experience or just some keyword. I'm not trying to sound rude or anything, I'd really like to know more. But again, on an open ended question? If you leave like that, I'm deeply tempted to lose all respect and gratitude. Not even sorry, if you mind. — rphii, Feb 09 '22 at 23:44
To minimize the effects of unflushed buffers use `fflush` after `fwrite`. To minimize kernel buffering/caching do `sync()` before and after the write loop. (e.g.) `sync, timget, write stuff, fflush, sync, timget` — Craig Estey, Feb 10 '22 at 00:22
@CraigEstey thanks for the pointers. Even though it seems that `clock_gettime` and `sync` are unix only stuff, I'll look into both of it. — rphii, Feb 10 '22 at 00:34

score 0 · Accepted Answer · answered Feb 11 '22 at 13:38

Since no one's gonna do it, I'll answer my question based on the comment I got.

clock does not measure the wall clock time but the CPU time. Please read this post.
Reads/writes are generally buffered.
Operating systems generally uses an in-memory cache (especially for HDD).
SSD reads can be faster in parallel (and often are for recent ones) while HDD are almost never faster in parallel. (this quite recent post provides some information about caching and buffering).