1

I'm trying to understand this requirement in POSIX-2017:

Writes can be serialized with respect to other reads and writes. If a read() of file data can be proven (by any means) to occur after a write() of the data, it must reflect that write(), even if the calls are made by different processes. A similar requirement applies to multiple write operations to the same file position.

1) Does "occur" refer to read being called, read returning successfully, or something else?

2) If, while one process is calling read, another process calls write twice on the same file, are there any circumstances where the read will reflect some or all of the second write, but not all of the first?

  |----------------read-----------------|
      |--write1--|       |--write2--|

3) How is this handled by implementations (e.g. ext4)? Is this something worth worrying about?

  • I'm pretty sure that the reads and writes are meant to be atomic with respect to each other. That is, the results are to be as if they're completed in the order that they're started. So in your example, that read would "complete" first, and then write1 would be completed before write2 (assuming that it's time passing left to right in your figure). Of course, what's actually going on on the HDD/SDD can be different to what's going on for the application(s) performing those reads and writes - the filesystem driver might be doing something clever, optimising writes, etc... – bazza Jan 06 '20 at 20:37
  • Thanks for your reply! This interpretation of the standard seems to conflict with the observations reported in [this thread on inter-process read/write atomicity](https://stackoverflow.com/questions/35595685/write2-read2-atomicity-between-processes-in-linux), though. Maybe I'm missing something subtle? – MasonicHedgehog Jan 06 '20 at 23:12
  • No worries, well yes I agree that this is in conflict with that! I think the key phrase is "*can* be serialised", i.e. is permitted to be, may be, but not necessarily guaranteed as such. POSIX is old, and is an amalgamation of differences in C libraries that existed back in the 1980s. It allows for variations in implementation in some areas to avoid making code that existed at the time not POSIX compliant. Basically, "here be Dragons, take care"; it's a hint that a program should use things like blocking reads/writes and semaphores to serialise I/O, if that matters to the program. – bazza Jan 07 '20 at 07:11

1 Answers1

0

To answer your first question:

"Occur" refers to the whole read, from the point of the call to the point of the value being returned. All of it has to happen after the previous write, and before the next write. The same page says so:

After a write() to a regular file has successfully returned:

  • Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.

  • Any subsequent successful write() to the same byte position in the file shall overwrite that file data.

POSIX makes no guarantee whatsoever on any sort of interleaving, because implementing additional guarantees is quite difficult.

Regarding the second question:

Again, refer to the above quote. If a process called write() and write() returned successfully, any subsequent read by any processes would reflect the written data.

So the answer is "yes, if the first write() failed".

Implementation:

ext4, and almost every other filesystem, uses a page cache. The page cache is an in-memory representation of the file's data (or a relevant part thereof). Any synchronization that needs to be done, is done using this representation. In that respect, reading and writing from the file is like reading and writing from shared memory.

The page cache, as the name suggests, is built with pages. In most implementations, a page is a region of 4k of memory, and reads and writes happen on a page basis.

This means that e.g. ext4 will serialize reads & writes on the same 4k region of the file, but a 12k write may not be atomic.

AFAICT, ext4 does not allow concurrent multiple writes on the same page, or concurrent reads & writes on the same page, but it is nowhere guaranteed.

edit: The filesystem (on-disk) block size might be smaller than a page, in which case some I/O may be done at a block-size granularity, but that is even less reliable in terms of atomicity.

root
  • 5,528
  • 1
  • 7
  • 15