4

I have lots of data which I would like to save to disk in binary form and I would like to get as close to having ACID properties as possible. Since I have lots of data and cannot keep it all in memory, I understand I have two basic approaches:

  1. Have lots of small files (e.g. write to disk every minute or so) - in case of a crash I lose only the last file. Performance will be worse, however.
  2. Have a large file (e.g. open, modify, close) - best sequential read performance afterwards, but in case of a crash I can end up with a corrupted file.

So my question is specifically:

If I choose to go for the large file option and open it as a memory mapped file (or using Stream.Position and Stream.Write), and there is a loss of power, are there any guarantees to what could possibly happen with the file?

  1. Is it possible to lose the entire large file, or just end up with the data corrupted in the middle?

  2. Does NTFS ensure that a block of certain size (4k?) always gets written entirely?

  3. Is the outcome better/worse on Unix/ext4?

I would like to avoid using NTFS TxF since Microsoft already mentioned it's planning to retire it. I am using C# but the language probably doesn't matter.

(additional clarification)

It seems that there should be a certain guarantee, because -- unless I am wrong -- if it was possible to lose the entire file (or suffer really weird corruption) while writing to it, then no existing DB would be ACID, unless they 1) use TxF or 2) make a copy of the entire file before writing? I don't think journal will help you if you lose parts of the file you didn't even plan to touch.

Lou
  • 4,244
  • 3
  • 33
  • 72
  • [This thread](https://stackoverflow.com/q/1154446/1488067) has some interesting information on append operations, which is the main type of operation. – Lou Jan 15 '19 at 23:07
  • Existing DBs achieve durability by virtue of three things: 1) write all modifications to a transaction log file first before confirming the operation, 2) have an OS that allows unbuffered writing to storage and 3) have hardware that guarantees that if a write is successful, it will be written to physical storage, even if the power should fail right after confirming the write. Consistency requires a little more work, but also leans heavily on this. The file system is mostly irrelevant in this story since metadata operations (new files, growing files) are rare. – Jeroen Mostert Jan 18 '19 at 13:55
  • The biggest enemy to ACID, incidentally, is commercial grade hardware that doesn't provide critical guarantee number 3: guarantee that a write has really been written. To improve the numbers in benchmarks, many drives/controllers will simply lie when asked to write something to disk *for real*, and return as soon as data has been stored in a (non-battery backed) cache. With such a setup the software has little chance of ensuring durability -- better hope the power never fails, or your data's not that important. This is what you pay a premium for in battery backed RAID controllers for servers. – Jeroen Mostert Jan 18 '19 at 14:04

2 Answers2

2

You can call FlushViewOfFile, which initiates dirty page writes, and then FlushFileBuffers, which according to this article, guarantees that the pages have been written.

Calling FlushFileBuffers after each write might be "safer" but it's not recommended. You have to know how much loss you can tolerate. There are patterns that limit that potential loss, and even the best databases can suffer a write failure. You just have to come back to life with the least possible loss, which typically demands some logging with a multi-phase commit.

I suppose it's possible to open the memory mapped file with FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH but that's gonna suck up your throughput. I don't do this. I open the memory mapped files for asynchronous I/O, letting the OS optimize the throughput with it's own implementation of async I/O completion ports. It's the fastest possible throughput. I can tolerate potential loss, and have mitigated appropriately. My memory mapped data is file backup data...and if I detect loss, I can can detect and re-backup the lost data once the hardware error is cleared.

Obviously, the file system has to be reliable enough to operate a database application, but I don't know of any vendors that suggest you don't still need backups. Bad things will happen. Plan for loss. One thing I do is that I never write into the middle of data. My data is immutable and versioned, and each "data" file is limited to 2gb, but each application employs different strategies.

Clay
  • 4,999
  • 1
  • 28
  • 45
1

The NTFS file system (and ext3-4) uses a transaction journal to operate the changes. Each changed is stored in the journal and the then, the journal itself is used to effectively peform the change. Except for catastrophic disk failures, the file system is designed to be consistent in its own data structures, not yours: in case of a crash, the recovery procedure will decide what to roll back in order to preserve the consistency. In case of roll back, your "not-yet-written but to-be-written" data is lost. The file system will be consistent, while your data not.

Additionally, there are several other factors involved: software and hardware caches introduce an additional layer, and therefore a point of failure. Usually the operations are performed in the cache, and then, the cache itself is flushed on disk. The file system driver won't see the operations performed "in" the cache, but we'll see the flush operations. This is done for performances reasons, as the hard drive is the bottleneck. Hardware controllers do have batteries to guarantee that their own cache can be flushed even in an event of power loss.

The size of a sector is another important factor, but this detail should not be taken into account as the hard drive itself could lie about its native size for interoperability purposes.

If you have a mewmory mapped and you insert data in the middle, while the power goes down, the content of the file might partially contain the change you did if it exceeds the size of the internal buffers.

TxF is a way to mitigate the issue, but has several implications which limits the contexts where you can use it: for eaxample it does not work on different drives or shared networks.

In order to be ACID, you need to design your data structures and/or the way you use it in order not to rely about the implementation details. For example, Mercurial (versioning tool) always appends its own data to its own revision log. There are many possible patterns, however, the more guarantees you need, the more technology specific you'll get (and by tied to).

Yennefer
  • 5,704
  • 7
  • 31
  • 44
  • Note that ext4 does *not* journal the data unless specifically requested via "`-o journal=data`", because this would cut the disk IO performance in half (due to writing the data first to the journal then to the filesystem). It will normally only journal metadata to ensure that the filesystem is not corrupted, and it is up to the application to ensure that what it writes to the file is consistent. – LustreOne Feb 06 '19 at 21:41
  • You are right, I forgot to mention that the default is ordered, and journal. The fact that the ordered is the default, is consistent with the answer as the file system heals itself and not its data. Thank you for pointing it out. – Yennefer Feb 06 '19 at 21:55