1

In zarr tutorial it is written:

Zarr arrays have not been designed for situations where multiple readers and writers are concurrently operating on the same array.

What would happen if it does happen? Will it crash? Undefined behavior? Will it just be slow or inefficient?

EDIT: Multiple writers and multiple readers are supported:

By data source we mean that multiple concurrent read operations may occur. By data sink we mean that multiple concurrent write operations may occur, with each writer updating a different region of the array

Example:

synchronizer = zarr.ProcessSynchronizer('data/example.sync') z = zarr.open_array(..., synchronizer=synchronizer)

agemO
  • 263
  • 2
  • 9
  • I would assume it would behave in the same way that a DB system would https://en.wikipedia.org/wiki/Write%E2%80%93write_conflict https://en.wikipedia.org/wiki/Read%E2%80%93write_conflict – E.Serra Jan 10 '19 at 11:54

1 Answers1

1

Going by their own docs, the default behavior is no synchronization.

So, it won't be slow/inefficient - that would happen if you did have sync and workers had to wait for other workers to release resources before proceeding.

It also won't crash, at least without third party interference - nothing's limiting the access, and I'm inferring there are no runtime checks against such a situation that might throw an error by design.

Undefined? Not quite, but we're getting closer. Assuming there are indeed no checks or locks by default, what you'll get is a race condition, i.e. if your Writer gets to your data first, the Reader trying to read it coming second will simply see whatever the Writer wrote.

If, instead, your Reader gets its grubby little IO on first, it will read the original data before it gets overwritten by the Writer. And if you have two Writers, whichever comes later will determine what is the final shape of the data.

Same holds for >2 Readers/Writers; I leave figuring out the specifics of the resulting mess as an exercise for you.

jkm
  • 704
  • 4
  • 7
  • I'm not sure to understand, if 2 writers is supported and 2 readers is supported, would'nt the locks also works for n writers + n readers ? – agemO Jan 10 '19 at 13:02
  • None of these are 'unsupported', inasmuch as there is nothing stopping you from doing that if you want to. There are no locks in neither case - unless you explicitly add your own or use the synchronizer. It's just that the program will behave unpredictably, depending on how fast each worker runs - R1->W1->R2->W2 has a different result than W1->R1->R2->W2, which has a different effect than R1->R2->W1->W2 and so on. Worst case, you wind up with (N!) *different* possible paths through the program, where N is number of workers. – jkm Jan 10 '19 at 14:02
  • 1
    'unless you explicitly add your own or use the synchronizer' Yes my question is what happen if I use the synchronizer (which is supposed to be made only for multiple writing), and try to read while writing: will the lock apply ? – agemO Jan 10 '19 at 14:35
  • 1
    @agemO I can sort of answer that (better late than never): the synchronizer does not seem to apply when you're reading while a writing operation is going on. There must be strategies around that (see https://stackoverflow.com/questions/75695118/zarr-how-to-avoid-reading-half-written-arrays-spanning-multiple-chunks) for some ideas. # When you do read before a write operation has finished, you risk getting shorter arrays or NaN blocks for any chunks that haven't yet been written. – bluppfisk Mar 10 '23 at 13:52