0

I am running a process that writes large files across the storage network. I can run the process using a simple loop and I get no failures. I can run using distributed and jobqueue during off peak hours and no workers fail. However when I run the same command during peak hours, I get worker killing themselves.

I have ample memory for the task and plenty of workers, so I am not sitting in a queue.

The error logs usually has a bunch of over garbage collection limits followed by a Worker killed with Signal 9

schierkolk
  • 29
  • 4

1 Answers1

1

Signal 9 suggests that the process has violated some system limit, not that Dask has decided for the worker to die. Since this only happens on high disk IO at busy times, indeed I agree that the network storage is the likely culprit, e.g., a lot of writes have been buffered, but are not being cleared through the relatively low bandwidth.

Dask also uses local storage for temporary files, and "local" might be the network storage. If you have real local disks on the nodes, you should use that, or if not, maybe turn off disk-spilling altogether. https://docs.dask.org/en/latest/setup/hpc.html#local-storage

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Do I need to increase the memory if I turn of disk spilling m? – schierkolk Feb 20 '21 at 17:09
  • *maybe*, or maybe a given worker will stop accepting new tasks until memory has been freed. – mdurant Feb 20 '21 at 17:49
  • That seemed to be the magic. distributed: worker: memory: target: false # don't spill to disk spill: false # don't spill to disk – schierkolk Mar 02 '21 at 13:21
  • Is there a way to find out the parameters off the job that failed ? – schierkolk Mar 02 '21 at 14:03
  • I assume the task is lost when workers die – schierkolk Mar 02 '21 at 15:19
  • They should come back as cancelled tasks, I think - the scheduler will refuse to retry tasks that have been associated with a worker crash more than a certain number of times. – mdurant Mar 02 '21 at 16:12
  • i see how to cancel a job but not how to retrieve a cancelled job. – schierkolk Mar 03 '21 at 17:18
  • I referred to "tasks", not sure what you mean by "job". The scheduler is doing the cancelling, not you. Doing `.compute()` or other execute action on a collection, or getting the results of a Future will give you the latest error, including anything cancelled. – mdurant Mar 03 '21 at 18:44