3

I have a program that is called by a script. This program writes a lot of data to a file on the disk and then terminates. As soon as it is done running, the script kills power to the entire system.

The problem I am having is that the file does not get written in its entirety. If it is a 4GiB file, only around 2GiB will actually be on the disk when I review it later. The only way I have been able to reliably ensure all data is written is to sleep the program for a small period once it's done before exiting but that is a really bad and unreliable hack that I don't want to use. Here is some sample code of what my latest attempt involved:

int main () {
    FILE *output;
    output = fopen("/logs/data", "w");

    [fwrite several GiB of data to output]

    fflush(output);

    int fdo = open("/logs", O_RDONLY);
    fsync(fdo);

    fclose(output);
    close(fdo);

    return 0;
}

I initially tried building my FILE with a file descriptor and calling fsync() on the descriptor used (/logs/data) however that produced the same issue. According to the spec for fsync(2):

Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.

which led me to the code I have above, creating a specific file descriptor just for the directory containing my data file and calling fsync() on that. However the results were the same. I don't really understand why this is happening because fsync() is supposed to be blocking:

The call blocks until the device reports that the transfer has completed.

additionally as you can see I added an fflush() on the FILE thinking maybe fsync() was only syncing data that had previously been flushed but this did not make any difference in the situation.

I need to somehow verify that the data has in fact been written to the physical media before ending the program and I'm not sure how to do that. I see that there are some files such as /sys/block/[device]/[partition]/stat which can tell me how many dirty blocks are left to write and I can wait for that value to hit 0 but this doesn't seem like a great way to solve what should be a simple issue and in addition if any other program is operating on the disk then I don't want to be waiting on them to sync their data as well since I only care about the integrity of this specific file and the stat file does not discriminate.

EDIT As per a suggestion I attempted to fsync() twice, first on the file and then on the directory:

int main () {
    FILE *output;
    int fd = open("/logs/data", O_WRONLY | O_CREAT, 660);
    output = fdopen(fd, "w");

    [fwrite several GiB of data to output]

    fsync(fd);
    int fdo = open("/logs", O_RDONLY);
    fsync(fdo);

    fclose(output);
    close(fd);
    close(fdo);

    return 0;
}

This produced some interesting output. With a 4GiB (4294967296 bytes) file, the actual size of data on the disk was 4294963200, which just so happens to be 1 page file (4096 bytes) off from the total value. It seems to be very close to a working solution, but it is still not guaranteeing every single byte of data.

Rachid K.
  • 4,490
  • 3
  • 11
  • 30
J. Doe
  • 127
  • 9
  • Did you try using `fsync(fileno(output))` ? Opening a second instance of the file may not complete the transfer. Ref: https://stackoverflow.com/questions/3167298/how-can-i-convert-a-file-pointer-file-fp-to-a-file-descriptor-int-fd – Halt State Jan 07 '21 at 16:57
  • @4386427 I've added them in for clarity, but it has no effect on the result. I had initially omitted them from my example since they didn't seem relevant to the problem seeing as how the buffers were getting flushed and written immediatly prior and the power was being turned off dumping the memory anyways – J. Doe Jan 07 '21 at 16:58
  • @HaltState I'm not opening a second instance of the file, I am opening the containing directory and performing fsync on that as per the fsync spec I had quoted in my post. Additionally as I mentioned in my post I also tried creating my FILE type using a fd to the location "fdopen(fdo, "w")" and fsync'd on fdo so it was a single instance that time – J. Doe Jan 07 '21 at 17:01
  • 1
    Read [this](https://devblogs.microsoft.com/oldnewthing/20100909-00/?p=12913) from Raymond Chen, and remember that storage devices lie. – Mark Benningfield Jan 07 '21 at 17:08
  • @MarkBenningfield this article highlights the problem, but it's windows specific whereas mine is linux specific. more than that I don't want some global setting, just in this specific circumstance for this file I want the disk to ensure it is actually on the physical media – J. Doe Jan 07 '21 at 17:17
  • 1
    I think the point is that fsyncing the directory is something you must do *in addition to* fsync'ing the file itself, which your current code does not do. – Nate Eldredge Jan 07 '21 at 17:25
  • @NateEldredge see edit to question, double fsyncing *almost* works but still not quite – J. Doe Jan 07 '21 at 17:47
  • 3
    Perhaps there is some unwritten data in `output`'s stdio buffer that is not written until the `fclose` call. Try calling `fflush(output);` before `fsync(fd);`. – Ian Abbott Jan 07 '21 at 17:53
  • The directory manipulation looks like woodoo to me. You need to do exactly three things: fflush(output), fsync(fd), and fclose(output), in that order, with nothing in between. – n. m. could be an AI Jan 07 '21 at 18:02
  • Enterprise SSD has power loss protection that guarantees write cache gets written. If you don't have that then you need FUA support to guarantee data is written before power off. – stark Jan 07 '21 at 18:08
  • 1
    How are you turning power off ("safely, while waiting for buffered writes to be written", or "dangerously/quickly with major recklessness")? – Brendan Jan 07 '21 at 18:14
  • If it works but you can't explain it, chances are, it doesn't really work. Perhaps try `syncfs(fd)` after (or instead of) `fsync(fd)` instead of fiddling with the directory. – n. m. could be an AI Jan 07 '21 at 18:55
  • @IanAbbott scratch my previous comment, it appears to be fixed *some of the time*. I've run it many times since, not too fast unfortunately since power cycling and writing 4GiB is slow but it's about 80/20 on whether it writes every single byte or is missing ~200MiB – J. Doe Jan 07 '21 at 18:57
  • @n.'pronouns'm. I had tried using syncfs earlier but it seems it is not available on the system I'm using – J. Doe Jan 07 '21 at 19:05
  • Not sure if it's relevant, but what filesystem are you using? – Nate Eldredge Jan 07 '21 at 19:12
  • @NateEldredge Ext4 – J. Doe Jan 07 '21 at 19:16
  • 2
    sync and flush only guarantee that the data is written from block cache to the device. It does not guarantee that the device has written the data to the media. That's what FUA (force unit access) and power loss prevention on SSD are for. None of the above guarantee that running processes have written all their data to block cache. That's what the Linux shutdown command is for. It tells all running processes to terminate. If you are just turning off power without doing shutdown, then you are deliberately losing data. – stark Jan 07 '21 at 20:14
  • @stark so what does the Linux shutdown command do to force that write and how can I do it in my program instead? If I'm not mistaken the 'unmount' command will force the same thing to happen to a mounted disk as well – J. Doe Jan 07 '21 at 20:19
  • You can't unmount a disk with files open. The purpose of unmount is to mark the filesystem clean so it doesn't need to be checked when it is mounted. – stark Jan 07 '21 at 21:20
  • @stark okay, but as you can see I close all the files I used, so after doing that why can't I do whatever process unmount uses to block until the data has reached non-volatile storage? – J. Doe Jan 07 '21 at 21:25
  • umount does not assume that you are powering off, so doesn't do that. See my answer below. – stark Jan 07 '21 at 21:57

3 Answers3

0

Have you considered passing the O_DIRECT and/or O_SYNC flags to open() ? From open() manual:

O_DIRECT
Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT.

O_SYNC
Write operations on the file will complete according to the requirements of synchronized I/O file integrity completion...

This article on LWN (quite old now) also provides some guidelines to ensure data integrity.

Rachid K.
  • 4,490
  • 3
  • 11
  • 30
  • I don't have O_DIRECT available on my system but I did attempt with O_SYNC. the problem I ran into there is that it just destroys performance (due to my number of fwrites) Even after batching all the writes into blocks of various sizes, the performance gains seemed to cap out at around ~7 mins for 4GiB of data regardless of how much larger I made my blocks. Now it does seem to work fine, but it's like half the time to do what I'm currently doing and then add a 10-15s sleep command to 'ensure' things are written so while it works, it's much worse than the hack – J. Doe Jan 07 '21 at 22:05
0

To ensure that all data is written to non-volatile storage, the shutdown command issues the sd_shutdown call to each disk. See https://elixir.bootlin.com/linux/v4.10.17/source/drivers/scsi/sd.c#L3338

This issues two SCSI commands: SYNC_CACHE and START_STOP_UNIT, which are translated to the appropriate action on the underlying device. For SATA devices this means putting the drive in STANDBY mode, which spins down the disk.

stark
  • 12,615
  • 3
  • 33
  • 50
0

In your script:

  • Optional: Run /bin/sync to flush changes in page cache to storage

  • Unmount the target file system (umount /mountpoint), or remount it read-only.

    If the target file system includes root (/) and/or system binaries or libraries (/usr), you cannot unmount the filesystem. In that case, remount the target file system read-only (mount -o remount,ro /mountpoint).

  • Run shutdown -h now to power down the system

This is the standard sequence that ensures the filesystems are in a clean state at shutdown, and that all changes hit the storage media.

Glärbo
  • 126
  • 3