is write() with O_DIRECT ACID compliant?

Question

My database engine writes records of 64 bytes by issuing write() syscall of the entire disk block. The device is opened with O_DIRECT mode. For example third record within a block starts at byte 128 and ends at position 192, when I do an UPDATE the entire disk block (which is by default 512 bytes) is written.

My question is, can I calim ACID compliance if I am writing the record over itself every time UPDATE occurs? Usually database engines do this in 2 steps by writing modified disk block to another (free) place and then updating an index to new block with one (atomic) write immediately after first write returned success. But I am not doing this, I am overwriting current data with new one expecting the write to be successful. Does my method has any potential problems? Is it ACID compliant? What if the hardware writes only half of the block and my record is exactly in the middle? Or does the hardware already does the 2 step write process I described , but at block level, so I don't need to repeat the same in software?

(note: no record is larger than physical disk block (512 bytes by default) and fsync goes after each write(), this is for Linux only)

possible duplicate of [Are disk sector writes atomic?](http://stackoverflow.com/questions/2009063/are-disk-sector-writes-atomic) — Nemo, Mar 16 '12 at 01:56
http://www.qnx.com/developers/docs/6.4.0/neutrino/sys_arch/fsys.html#QNX6_filesystem — Zan Lynx, Apr 03 '12 at 16:39

Mike Sherrill 'Cat Recall' · Answer 1 · 2012-03-16T01:50:51.467

ACID anticipates failures, and suggests ways to deal with them. Two-phase commits and three-phase commits are two fairly common and well-understood approaches.

Although I'm a database guy, the dbms frees me from having to think about this kind of thing very much. But I'd say overwriting a record without taking any other precautions is liable to fail the "C" and "D" properties ("consistent" and "durable").

To build really good code, imagine that your dbms server has no battery-backed cache, only one power supply, and that during a transaction there's a catastrophic failure in that one power supply. If your dbms can cope with that kind of failure fairly cleanly, I think you can call it ACID compliant.

Later . . .

I read Tweedie's transcript. He's not talking about database direct disk access; he's talking about a journaling filesystem. A journaling filesystem also does a two-phase commit.

It sounds like you're trying to reach ACID compliance (in the database sense) with a single-phase commit. I don't think you can get away with that.

Opening with O_DIRECT means "Try to minimize cache effects of the I/O to and from this file" (emphasis added). I think you'll also need O_SYNC. (But the linked kernel docs caution that most Linux filesystems don't implement POSIX semantics of O_SYNC. And both filesystems and disks have been known to lie about whether a write has hit a platter.)

There are two more cautions in the kernel docs. First, "It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default." You're not doing that. You're trying to use it to achieve ACID compliance.

Second,

"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances." -- Linus

SQLite has a readable paper on how they handle atomic commits. Atomic Commit in SQLite

I thought about what you say but even in two phase commit, when you start writing the request to the transaction logfile, how do you ensure the write into logfile was completed? For example, your transaction is 64 bytes, you sent a write() to disk, then power goes down. The disk has written 33 bytes. When the power goes back and you read that sector, how do you know the remaining 31 bytes were written or not? — Nulik, Mar 15 '12 at 01:59
@Nulik: If there's a failure, the restart procedure will look for a decision record in the coordinator's log. If it finds one, processing resumes where it left off. If not, it assumes a rollback. Date has several pages on this detailed topic in his *Introduction to Database Systems*. You could also read the [PostgreSQL source code](http://doxygen.postgresql.org/). — Mike Sherrill 'Cat Recall', Mar 15 '12 at 02:14
well I actually found the answer to the question. Disk sector writes appear to be atomic, as answered here : http://stackoverflow.com/questions/2009063/are-disk-sector-writes-atomic , so since I am writing single sectors, I shouldn't worry about that, unless I got some very old hardware — Nulik, Mar 15 '12 at 02:51

Zan Lynx · Answer 2 · 2012-03-28T21:05:19.793

1

No.

You cannot assume the disk write will be successful. And you cannot assume that the disk will leave the existing data in place. Here is some QNX documentation also stating this.

If you got really, really unlucky the disk power will fail while it is writing, leaving the block with corrupt checksums and half-written data.

This is why ACID systems use at least two copies of the data.

edited Mar 28 '12 at 21:05

answered Mar 14 '12 at 16:40

Zan Lynx

53,022
10
79
131

According to [an earlier SO answer](http://stackoverflow.com/questions/2009063/are-disk-sector-writes-atomic), you are wrong, at least for some (all?) disks... which guarantee atomic single-sector writes even if the power fails. – Nemo Mar 16 '12 at 01:57
@Nemo: I know from personal experience that hard drives come up with bad sectors after a power failure event. Perhaps server drives guarantee it. Western Digital Green drives do not. – Zan Lynx Mar 16 '12 at 08:23

score 0 · Answer 3 · answered Aug 22 '20 at 14:24

is write() with O_DIRECT ACID compliant?

No, this is not guaranteed in the general case. Here are some counterexamples for Durability:

O_DIRECT makes no guarantees that acknowledged data made it out of a volatile cache that is part of the device
O_DIRECT makes no guarantees about persistence of filesystem metadata that might be required to actually read back the (acknowledged) write data (e.g. in the case of appending writes)

My question is, can I calim [sic] ACID compliance if I am writing the record over itself every time UPDATE occurs?

In the general case no. For example a spec compliant SCSI disk doesn't have to guarantee the semantics of only getting only the old or only the new data if a crash happens mid-write (it's legal for it to return an error reading that data until the region is unconditionally overwritten). If you're doing a write to a file in a filesystem then things are even more complicated. Having a successful fsync() after the write() before you issue new I/O will help you to know the write was stable but is not enough to ensure Atomicity (only old or new data) in the general case of awkwardly timed power loss.

Does my method [assuming overwrites are perfectly atomic] has [sic] any potential problems?

Yes, see above. What you are doing may work as you wish in certain setups but there's no guarantee it should work in all (even though they are "non-faulty" per their spec).

See this answer on "What does O_DIRECT really mean?" for further discussion.

is write() with O_DIRECT ACID compliant?

3 Answers3

Linked