fwrite() performance well below disk capacity

Question

I have a dynamically allocated array of a struct with 17 million elements. To save it to disk, I write

fwrite(StructList, sizeof(Struct), NumStructs, FilePointer)

In a later step I read it with an equivalent fread statement, that is, using sizeof(Struct) and a count of NumStructs. I expect the resulting file will be around 3.5 GB (this is all x64).

Is it possible instead to pass sizeof(Struct) * NumStructs as the size and 1 as the count to speed this up? I am scratching my head as to why the write operation could possibly take minutes on a fast computer with 32 GB RAM (plenty of write cache). I've run home-brew benchmarks and the cache is aggressive enough that 400 MB/sec for the first 800 MB to 1 GB is typical. PerfMon shows it is consuming 100% of one core during the fwrite.

I saw the question here so what I'm asking is, whether there is some loop inside fwrite that can be "tricked" to go faster by telling it to write 1 element of size n*s as opposed to n elements of size s.

EDIT

I ran this twice in release mode and both times I gave up waiting. Then I ran it in debug mode knowing that typically the fwrite operations take way longer. The exact size of the data to be written is 4,368,892,928 bytes. In all three cases, PerfMon shows two bursts of disk write activity about 30 seconds apart, after which the CPU goes to 100% of one core. The file is at that point 73,924,608 bytes. I have breakpoints on either side of the fwrite so I know that's where it's sitting. It certainly seems that something is stuck but I will leave it running overnight and see.

EDIT

Left this overnight and it definitely hung in fwrite, the file never went past 70 MB.

Have you tried it? To keep the caching from getting in the way, you can occupy the rest of the ram with some other application. — Mysticial, Feb 22 '14 at 00:18
Changing the arguments like that won't affect the performance; it would only affect the error reporting if there's a short write. There isn't going to be much you can do to improve the performance, IMO. You've minimized the system call overhead by only requiring one call; from there, it is a question of how quickly the o/s can allocate 3.5 GiB disk space. You could look for system calls to preconfigure the file size and to make the allocation as near sequential as possible, but that's very platform specific. I see you've got Windows in the tags; I can't help much more... — Jonathan Leffler, Feb 22 '14 at 00:29
If your program is burning 100% core then it is *not* getting bogged down by the disk. Which makes your assumption that it has anything to do with fwrite() a pretty weak one. Mistrust any anti-malware and use a profiler. — Hans Passant, Feb 22 '14 at 00:39
1. You do know that `fwrite` use a buffer - a raw `write` would be better. 2. Change the design - why read/write huge chunks of data to/from the disk. — Ed Heal, Feb 22 '14 at 05:26
You didn't give your compiler or runtime library. Your file is big enough for the byte count to overflow a 32-bit unsigned size_t. A poor or old crtl could be screwing this up. — Gene, Feb 22 '14 at 05:32
@Gene: I did say this was 64-bit (OS and build, to be more specific). To be specific, Visual Studio 2012 (with multithreaded CRT) in Windows 7. — darda, Feb 22 '14 at 14:17
@EdHeal: A) why would I not want a buffer, and B) how could my program picked up where it left off a week ago without writing to disk? — darda, Feb 22 '14 at 14:18
If you want to specifically target Windows, then you should look into memory mapped files; that way, you can write directly to a memory-mapped version of the file. However, fwrite shouldn't be so slow, so something else is going wrong here. — MicroVirus, Feb 22 '14 at 14:44
@pelesl Sorry I guess I missed the tag (though could swear it wasn't there last night). This is something else wrong. Probably you clobbered the stack with other code and that corruption causes `fwrite` to hang. Could possibly be a weird problem with your VS installation (though unlikely). Write a tiny program that fills a buffer of the same size and `fwrite`s it to rule out all but an error in your program. When it works (I think it will), pare your program down to a SCCE that has the problem and post that. — Gene, Feb 22 '14 at 15:18
@Gene: I added the VS2012 tag today, x64 bit was in the text originally. No worries. As you can see I reproduced the problem in a simple program that I posted as an answer. — darda, Feb 22 '14 at 15:31
@MicroVirus: I will look into memory mapped files, thanks. Usually the output is not this big; this is a bit of a special case. — darda, Feb 22 '14 at 15:31
@pelesl - `fwrite` uses a buffer - use `write` instead. As it is a special case consider a different design to achieve the same outcome — Ed Heal, Feb 22 '14 at 16:11

darda · Accepted Answer · 2014-02-24T00:52:53.067

This is definitely a problem with fwrite (I tried both VS2012 and 2010).

Starting with a standard C++ project, I changed only the setting to use multi-byte character set, x64 target, and the multithreaded debug version of the standard library in a static link.

The following code succeeds (no error checking for conciseness):

#define _CRT_SECURE_NO_WARNINGS
#include <stdio.h>
#include <stdlib.h>

int main()
{
    FILE *fp;
    long long n;
    unsigned char *data;

    n = 4LL * 1024 * 1024 * 1024 - 1;

    data = (unsigned char *)malloc(n * sizeof(unsigned char));

    fp = fopen("T:\\test.bin", "wb");

    fwrite(data, sizeof(unsigned char), n, fp);

    fclose(fp);
}

In the debug version on my machine, the program finishes in about 1 minute (the malloc takes only a few seconds, so this is mostly fwrite), consuming on average 30% CPU. PerfMon shows the write occurs entirely at the end is a single "flash" of 4 GB (write cache).

Change the - 1 to a + 1 in the assignment of n and you reproduce the problem: instantaneous 100% CPU usage and nothing is ever written. After several minutes, the size of the file was still 0 bytes (recall in my actual code it manages to dump 70 MB or so).

This is definitely a problem in fwrite, as the following code can write the file just fine:

int main()
{
    FILE *fp;
    long long n;
    long long counter = 0;
    long long chunk;
    unsigned char *data;

    n = 4LL * 1024 * 1024 * 1024 + 1;

    data = (unsigned char *)malloc(n * sizeof(unsigned char));

    fp = fopen("T:\\test.bin", "wb");

    while (counter < n)
    {
        chunk = min(n - counter, 100*1000);
        fwrite(data+counter, sizeof(unsigned char), chunk, fp);
        counter += chunk;
    }

    fclose(fp);
}

On my machine, this took 45 seconds instead of 1 minute. CPU usage is not constant, it comes in bursts, and the reported IO writes are more distributed than in the "single chunk" method.

I would be really surprised if the increase in speed is false (that is, due to caching) because I've done tests before writing several files containing all the same data vs files containing randomized data and the reported write speeds (with caching) are the same. So I'm willing to bet that at least this implementation of fwrite does not like huge chunks passed to it at a time.

I also tested fread to read immediately after closing the file for write in the 4 GB+1 case and it returns in a timely manner - a few seconds at most (no real data here so I didn't check it).

EDIT

I ran some tests with the chunk-writing method and the single fwrite call of a 4 GB-1 file (the largest size which both methods can do). Running the program several times (with code such that the file is opened, written with multiple fwrite calls, closed, then opened again, written in a single call, and closed), there is no question the chunk-writing method returns faster. At worst case it returns in 68% of the time it takes for the single call and at best I got just 20%.

Yes looks like a 32 bit wrapping bug. Re performance, bear in mind that `malloc` costs can be paid forward if it reserves VM space for big blocks but does not initialize swap until the memory is touched. I do not know if VC malloc does this. But if so this can cause strange interaction between the write and the pager. It would be fun to try initializing the memory before writing it. — Gene, Feb 22 '14 at 15:55

score 0 · Answer 2 · answered Feb 22 '14 at 17:27

0

This is not a problem with fwrite but intended (though admittedly uncool) behavior:

The fwrite() function shall write, from the array pointed to by ptr, up to nitems elements whose size is specified by size, to the stream pointed to by stream. For each object, size calls shall be made to the fputc() function, taking the values (in order) from an array [...]

So basically, by using fwrite correctly without cheating, you are requesting billions of calls to fputc.
With the above requirement in mind, it's clear how you have to cheat in order to make it work properly, too.

answered Feb 22 '14 at 17:27

Damon

67,688
20
135
185

1

I don't understand what you refer to as cheating? And site your sources. – darda Feb 22 '14 at 18:37
If you are writing X structures of size Y and you are telling the C standard library to instead write one object of size X*Y, that is factually "cheating". You're not telling the truth about what you're doing. The standard quote is from http://pubs.opengroup.org/onlinepubs/9699919799/functions/fwrite.html but you can refer to the identical wording in 17.9.8.2 of ISO 9899 (C99) instead, if you prefer that. – Damon Feb 22 '14 at 20:03
I don't read it the way you do. To me this says that fwrite has two nested loops inside of which is a call to fputc. Cheating, as you say, makes no difference other than granularity of the return value (i.e. knowing how many items were actually written) – darda Feb 23 '14 at 15:59
Well, go ahead and measure if you don't believe it. I ran a test on my system (Win7 64, with MinGW-gcc 4.8.2) doing 500k writes with 1M structs of size 16 each, and it takes 11.9 seconds if you tell the C standard library the truth about what you're doing, 12.1 seconds if you tell it to write 16times as many objects of size 1, but only 7.2 seconds if you tell it that you're writing a single huge object. Re-ran each test 5 times, timings are consistent, varying by +/- 0.2 seconds. It certainly makes a difference which way the loop inside `fwrite` runs. – Damon Feb 24 '14 at 10:30
you should post an answer [here](http://stackoverflow.com/questions/10564562/fwrite-effect-of-size-and-count-on-performance/21977008#21977008) with your findings. I will try it with my test program and let you know. – darda Feb 25 '14 at 15:05

fwrite() performance well below disk capacity

2 Answers2

Linked