13

Before overwriting data in a file, I would like to be pretty sure the old data is stored on disk. It's potentially a very big file (multiple GB), so in-place updates are needed. Usually writes will be 2 MB or larger (my plan is to use a block size of 4 KB).

Instead of (or in addition to) calling fsync(), I would like to retain (not overwrite) old data on disk until the file system has written the new data. The main reasons why I don't want to rely on fsync() is: most hard disks lie to you about doing an fsync.

So what I'm looking for is what is the typical maximum delay for a file system, operating system (for example Windows), hard drive until data is written to disk, without using fsync or similar methods. I would like to have real-world numbers if possible. I'm not looking for advice to use fsync.

I know there is no 100% reliable way to do it, but I would like to better understand how operating systems and file systems work in this regard.

What I found so far is: 30 seconds is / was the default for /proc/sys/vm/dirty_expire_centiseconds. Then "dirty pages are flushed (written) to disk ... (when) too much time has elapsed since a page has stayed dirty" (but there I couldn't find the default time). So for Linux, 40 seconds seems to be on the safe side. But is this true for all file systems / disks? What about Windows, Android, and so on? I would like to get an answer that applies to all common operating systems / file system / disk types, including Windows, Android, regular hard disks, SSDs, and so on.

Steve McLeod
  • 51,737
  • 47
  • 128
  • 184
Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132
  • Just a question: Why would you want to write data that you're throwing away anyway? – Nikos C. Nov 30 '12 at 17:35
  • 1
    You may also want to use the `sync(2)` or `syncfs(2)` syscall after your `fsync(2)` syscall, and you might want also to use `sync_file_range(2)` -with caution. – Basile Starynkevitch Nov 30 '12 at 17:37
  • Note that the "default" value, either defined by the OS kernel, or by distribution packagers via startup scripts, is not guaranteed. There are a number of valid reasons that specific systems may have completely different values because the administrators have tuned them that way. It might be better to figure out how to approach your problem without having to make assumptions like this... – twalberg Nov 30 '12 at 17:42
  • 1
    @NikosC. I am writing library. The library which has no control over what methods the application calls. – Thomas Mueller Nov 30 '12 at 18:09
  • @BasileStarynkevitch did you read the article "most hard disks lie to you about doing an fsync" in the question? fsync and related methods are not a reliable solution, I don't want to depend on them. Also, they are not available on all operating systems. – Thomas Mueller Nov 30 '12 at 18:10
  • @twalberg I will make the "retention time" value configurable, but I am asking here to get some answers to what are common values (30 seconds, 2 minutes,...). – Thomas Mueller Nov 30 '12 at 18:12
  • You might set the default to 1 minute, but focus the user's attention on that configurable value.... – Basile Starynkevitch Nov 30 '12 at 19:12
  • FreeBSD UFS writes to disk every 30 seconds, but then you need to take into account data staying on the disk cache time too. – Good Person Dec 03 '12 at 16:15

5 Answers5

3

Let me restate this your problem in only slightly-uncharitable terms: You're trying to control the behavior of a physical device which its driver in the operating system cannot control. What you're trying to do seems impossible, if what you want is an actual guarantee, rather than a pretty good guess. If all you want is a pretty good guess, fine, but beware of this and document accordingly.

You might be able to solve this with the right device driver. The SCSI protocol, for example, has a Force Unit Access (FUA) bit in its READ and WRITE commands that instructs the device to bypass any internal cache. Even if the data were originally written buffered, reading unbuffered should be able to verify that it was actually there.

eh9
  • 7,340
  • 20
  • 43
  • 1
    I'm not trying to actively control behavior, I just want to know what is the typical behavior. Specially, I would like to know after how many seconds a most the data ends up on the disk, without using fsync. It's a smart idea to try to read data from disk without using a buffer, but unfortunately there is no cross-platform way to achieve that. – Thomas Mueller Dec 03 '12 at 05:18
  • When you say "I would like to ensure the old data is stored on disk", that sounds like a hard guarantee, which means you need to control certain aspects of the disk write, or more verbosely to restrict the behavior of the disk away from certain forbidden action. If you don't (or can't) control it, you can't offer a hard guarantee. This is what I meant by documenting it appropriately. – eh9 Dec 04 '12 at 00:26
  • 1
    I know it's probably impossible to get a hard guarantee for all operating systems and hard disks. I just like to understand as much of the facts as possible. – Thomas Mueller Dec 04 '12 at 05:04
  • I would have liked to get more answers, with specific data such as what is the maximum time for Windows to wait until it flushes the buffer. But missing that information, I think this is the best answer. – Thomas Mueller Dec 09 '12 at 17:53
  • 1
    This was written for Windows 2000 but is still mostly relevant: https://msdn.microsoft.com/en-us/library/bb742613.aspx#ECAA – Daira Hopwood Oct 27 '15 at 10:03
  • 1
    In particular: "An older Cache Manager utility [from SysInternals] documents some of the internal constants [...] used in the lazy write algorithm. These include CcFirstDelay, which delays writes three seconds after their first access; CcIdleDelay, which triggers writes one second into an idle period; and CcCollisionDelay, which triggers a 100-millisecond delay if a speculative lazy write encounters a disk busy condition. As of this writing, it is not certain if these parameters that control Cache Manager operation were carried forward into Windows 2000, but it seems likely they were." – Daira Hopwood Oct 27 '15 at 10:09
  • See also https://msdn.microsoft.com/en-us/library/windows/hardware/ff563944(v=vs.85).aspx for how to query an SCSI-compatible drive's reported caching properties (unfortunately not the timings) on Windows. – Daira Hopwood Oct 27 '15 at 10:20
2

The only way to reliably make sure that data has been synced is to use the OS specific syncing mechanism, and as per PostgreSQL's Reliability Docs.

When the operating system sends a write request to the storage hardware, there is little it can do to make sure the data has arrived at a truly non-volatile storage area. Rather, it is the administrator's responsibility to make certain that all storage components ensure data integrity.

So no, there are no truly portable solutions, but it is possible (but hard) to write portable wrappers and deploy a reliable solution.

udoprog
  • 1,825
  • 14
  • 14
  • I'm looking for answers of the type "data is written to disk after at most 30 seconds". This link does not answer this question. – Thomas Mueller Dec 03 '12 at 05:17
  • 1
    Without flushing and configuration, the answer is that it can be arbitrarily long, since in theory the disk buffer could retain a write request forever pending other requests that flush it with write acceleration. – udoprog Dec 03 '12 at 07:23
  • 1
    do you have a link to back this up? – Thomas Mueller Dec 03 '12 at 08:44
  • 1
    According to the documentation I found (see the link above, "Writing Dirty Pages to Disk"), for Linux, you are wrong, and dirty pages are "flushed (written) to disk ... Too much time has elapsed since a page has stayed dirty", and the setting is `dirty_expire_centiseconds`. But if you have link that says data might never be written, please post it! Also, I'm looking for information how things work for other operating systems (Windows) and file systems (specially SSDs). – Thomas Mueller Dec 03 '12 at 08:58
  • I am simply referring to the case when disk write cache is enabled and write ops are buffered instead of applied. There are no operating system level facility that works across all disks to assert that this has been flushed, hence the "arbitrary" part in that it is up to the disk manufacturer to decide on the flush strategy. Also note that this is unrelated to the OS page cache. – udoprog Dec 03 '12 at 16:31
  • 1
    I believe that disks write the data within a few seconds. If you are saying that this can take a few minutes for some disks, then it would be nice if you could provide a link. I would be really interested in how it actually works, versus guessing... – Thomas Mueller Dec 03 '12 at 18:22
  • Your question is stated as "at most", I am simply trying to point out that it is dependent on implementation details of the hard drive in question. I do not have a source for specifics in implementation, you'd probably have to work for the companies manufacturing the disks to get this. Less than a few seconds is probably a safe bet but it doesn't guarantee that the data has been flushed. – udoprog Dec 03 '12 at 19:19
  • OK, thanks. I also don't know how it works in reality, that's why I asked, hoping that somebody does know :-) – Thomas Mueller Dec 03 '12 at 19:31
2

First of all thanks for the information that hard disks lie about flushing data, that was new to me.

Now to your problem: you want to be sure that all data that you write has been written to the disk (lowest level). You are saying that there are two parts which need to be controlled: the time when the OS writes to the hard drive and the time when the hard drive writes to the disk.

Your only solution is to use a fuzzy logic timer to estimate when the data will be written.

In my opinion this is the wrong way. You have control about when the OS is writing to the hard drive, so use the possibility and control it! Then only the lying hard drive is your problem. This problem can't be solved reliably. I think, you should tell the user/admin that he must take care when choosing the right hard drive. Of course it might be a good idea to implement the additional timer you proposed.
I believe, it's up to you to start a row of tests with different hard drives and Brad Fitzgerald's tool to get a good estimation of when hard drives will have written all data. But of course - if the hard drive wants to lie, you can never be sure that the data really has been written to the disk.

Werner Henze
  • 16,404
  • 12
  • 44
  • 69
  • Yes, testing this is the way to go, I will do that. I know this can't be solved reliably, but I would like to get more information about known delays, such as the `dirty_expire_centiseconds` in Linux. I will try to re-phrase the question. – Thomas Mueller Dec 07 '12 at 18:11
1

There are a lot of caches involved in giving users a responsive system.

There is cpu cache, kernel/filesystem memory cache, disk drive memory cache, etc. What you are asking is how long does it take to flush all the caches?

Or, another way to look at it is, what happens if the disk drive goes bad? All the flushing is not going to guarantee a successful read or write operation.

Disk drives do go bad eventually. The solution you are looking for is how can you have a redundant cpu/disk drive system such that the system survives a component failure and still keeps working.

You could improve the likelihood that system will keep working with aid of hardware such as RAID arrays and other high availability configurations.

As far software solution goes, I think the answer is, trust the OS to do the optimal thing. Most of them flush buffers out routinely.

Arun Taylor
  • 1,574
  • 8
  • 5
0

This is an old question but still relevant in 2019. For Windows, the answer appears to be "at least after every one second" based on this:

To ensure that the right amount of flushing occurs, the cache manager spawns a process every second called a lazy writer. The lazy writer process queues one-eighth of the pages that have not been flushed recently to be written to disk. It constantly reevaluates the amount of data being flushed for optimal system performance, and if more data needs to be written it queues more data.

To be clear, the above says the lazy writer is spawned after every second, which is not the same as writing out data every second, but it's the best I can find so far in my own search for an answer to a similar question (in my case, I have an Android apps which lazy-writes data back to disk and I noticed some data loss when using an interval of 3 seconds, so I am going to reduce it to 1 second and see if that helps...it may hurt performance but losing data kills performance a whole lot more if you consider the hours it takes to recover it).

Eric Mutta
  • 1,144
  • 1
  • 11
  • 15