1

I'm trying to parse some code which works with O_DIRECT files.

ssize_t written = write(fd, buf, size);

What is confusing is that size can be lower than the sector size of the disk, thus does write(fd,buf,size) write the entirety of buf to fd or only the first size bytes of buf to disk?

Without O_DIRECT this is simply the second case, but I can't find any documentation about in the case of O_DIRECT, and from what I've read it will still send buf to the disk, so the only thing I can think of is that it also tells the disk to only write size...

Cjen1
  • 1,826
  • 3
  • 17
  • 47
  • "write the entirety of `buf`" --> how big is `buf`, a sector size? – chux - Reinstate Monica Jul 30 '20 at 19:00
  • @stark Ideally, but IME Linux file systems don't reliably handle improperly-sized (or non-page-aligned) I/O requests to a file opened with `O_DIRECT`, with little to no documentation as to what "improperly-sized" even is. See the `O_DIRECT` paragraphs in [the **NOTES** section of the Linux `open(2)` man page](https://man7.org/linux/man-pages/man2/open.2.html#NOTES). – Andrew Henle Jul 30 '20 at 19:07
  • @AndrewHenle I deleted my comment. Pretty sure the RMW I saw was in the application. An unaligned write will always return EINVAL. – stark Jul 30 '20 at 19:57
  • @stark IME unaligned buffers have also resulted in EINVAL, but the `open(2)` man page does say that size and alignment restrictions "might be absent entirely". Most common hardware today should be able to handle direct IO with unaligned buffers of any size, but I'd be leery of unaligned direct IO on lower-end hardware, especially disk controllers/HBAs. [Even expensive high-end hardware can have bugs in such situations.](https://github.com/illumos/illumos-gate/blob/4e0c5eff9af325c80994e9527b7cb8b3a1ffd1d4/usr/src/cmd/fs.d/ufs/mkfs/mkfs.c#L430). Given the crappy disk controllers out there... – Andrew Henle Jul 30 '20 at 20:40
  • Reading or writing a disk unaligned would never work. The smallest addressable unit is a sector. The question was about O_DIRECT for file access, where the implementation is up to the filesystem code. Note that files do not have to be sector-aligned on disk, although they frequently are, so the filesystem likely already has code for buffering and fixing alignment. – stark Jul 30 '20 at 21:25
  • `write(fd, buf, size)` will only access `buf[0]` to `buf[size-1]`, inclusive, except if there is a bug in the file system or block device driver. It is also perfectly normal for it to only access/write a smaller portion of the buffer, only some initial part, and return the number of bytes written (*"short count"*). (Also things like `EINTR` may occur, if the process has signals delivered to handlers installed without `SA_RESTART`.) – None Jul 30 '20 at 23:12

1 Answers1

0

[...] does write(fd,buf,size) write the entirety of buf to fd or only the first size bytes of buf to disk?

If the write() call is successful it means all of the requested size data has been written but the question becomes: written to where? You have to remember that opening a file with O_DIRECT is sending more of a hint that you want to bypass OS caches rather than order. The filesystem could choose to simply write your I/O through the page cache either because that's what it always does or because you broke the rules regarding alignment and using the page cache is a way of quietly fixing up your mistake. The only way to know this would be to investigate the data path when the I/O was issued.

Anon
  • 6,306
  • 2
  • 38
  • 56