5

I am running multiple parallel tasks on a multi-node distributed Dask cluster. However, once the tasks are finished, workers still hold large memory and cluster gets filled up soon.

I have tried client.restart() after every task and client.cancel(df), the first one kills workers and sends CancelledError to other running tasks which is troublesome and second one did not help much because we use a lot of custom objects and functions inside Dask's map functions. Adding del for known variables and gc.collect() also doesn't help much.

I am sure most of the memory held up is because of custom python functions and objects called with client.map(..).

My questions are:

  1. Is there a way from command-line or other wise which is like trigger worker restart if no tasks are running right now?
  2. If not, what are the possible solutions to this problem? It will be impossible for me to avoid custom objects and pure python functions inside Dask tasks.
jtlz2
  • 7,700
  • 9
  • 64
  • 114
spiralarchitect
  • 880
  • 7
  • 19

1 Answers1

2

If there are no references to futures then Dask should delete any references to Python objects that you've created with it. See https://www.youtube.com/watch?v=MsnzpzFZAoQ for more information on how to investigate this.

If your custom Python code does have some memory leak of its own then yes, you can Ask Dask workers to periodically restart themselves. See the dask-worker --help man page and look for keywords that start with --lifetime

jtlz2
  • 7,700
  • 9
  • 64
  • 114
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • `--lifetime` did not help really. It restarted workers, but the running tasks went on indefinite limbo. I had to restart workers and run again. – spiralarchitect Jan 19 '20 at 12:52
  • 1
    Are the workers simply killed after expiration of the lifetime? It would be useful to have a lifetime option that shuts them down smoothly, e.g. distributing their data and letting them complete running tasks. – malbert Oct 15 '20 at 14:04
  • 1
    @malbert this is what happens. At the end of the lifetime Dask tries to gracefully shutdown a worker. – MRocklin Nov 03 '20 at 19:29
  • @MRocklin that's not what happens: dask actually kills the worker at the end of the lifetime in the middle of whatever task it's running. There's an enhancement request to make it wait until the task has finished: https://github.com/dask/dask-jobqueue/issues/416 – rleelr Nov 02 '21 at 15:25