2

Is the order of page flushes with msync(MS_ASYNC) on linux guaranteed to be the same as the order the pages where written to?

If it depends on circumstances, is there a way for me (full server access) to make sure they are in the same order?

Background

I'm currently using OpenLDAP Symas MDB as a persistent key/value storage and without MDB_MAPASYNC - which results in using msync(MS_ASYNC) (I looked through the source code) - the writes are so slow, that even while processing data a single core is permanently waiting on IO at sometimes < 1MB/s. After analyzing, the problem seems to be many small IO Ops. Using MDB_MAPASYNC I can hit the max rate of my disk easily, but the documentation of MDB states that in that case the database can become corrupted. Unfortunately the code is too complex to me/I currently don't have the time to work through the whole codebase step by step to find out why this would be, and also, I don't need many of the features MDB provides (transactions, cursors, ACID compliance), so I was thinking of writing my own KV Store backed by mmap, using msync(MS_ASYNC) and making sure to write in a way that an un-flushed page would only lose the last touched data, and not corrupt the database or lose any other data.

But for that I'd need an answer to my question, which I totally can't find by googling or going through linux mailing lists unfortunately (I've found a few mails regarding msync patches, but nothing else)

On a note, I've looked through dozens of other available persistent KV stores, and wasn't able to find a better fit for me (fast writes, easy to use, embedded(so no http services or the like), deterministic speed(so no garbage collection or randomly run compression like leveldb), sane space requirements(so no append-only databases), variable key lengths, binary keys and data), but if you know of one which could help me out here, I'd also be very thankful.

griffin
  • 1,261
  • 8
  • 24

2 Answers2

1

msync(MS_ASYNC) doesn't guarantee the ordering of the stores, because the IO elevator algos operating in the background try to maximize efficiency by merging and ordering the writes to maximize the throughput to the device.

brandx
  • 1,053
  • 12
  • 10
1

From man 2 msync:

Since Linux 2.6.19, MS_ASYNC is in fact a no-op, since the kernel properly tracks dirty pages and flushes them to storage as necessary.

Unfortunately, the only mechanism to sync a mapping with its backing storage is the blocking MS_SYNC, which also does not have any ordering guarantees (if you sync a 1 MiB region, the 256 4 KiB pages can propagate to the drive in any order -- all you know is that if msync returns, all of the 1 MiB has been synced).

Travis Gockel
  • 26,877
  • 14
  • 89
  • 116