0

I am using a SLURM cluster with Dask and don't quite understand the configuration part. The documentation talks of jobs and workers and even has a section on the difference:

In dask-distributed, a Worker is a Python object and node in a dask Cluster that serves two purposes, 1) serve data, and 2) perform computations. Jobs are resources submitted to, and managed by, the job queueing system (e.g. PBS, SGE, etc.). In dask-jobqueue, a single Job may include one or more Workers.

Problem is I still don't get it. I use the word task to refer to a single function one submits using a client, i.e with a client.submit(task, *params) call.

My understanding of how Dask works is that there are n_workers set up and that each task is submitted to a pool of said workers. Any worker works on one task at a given time potentially using multiple threads and processes.

However my understanding does not leave any room for the term job and is thus certainly wrong. Moreover most configurations of the cluster (cores, memory, processes) are done on a per job basis according to the docs.

So my question is what is a job? Can anyone explain in simpler terms its relation to a task and a worker? And how the cores, memory, processes, and n_workers configurations interact? (I have read the docs, just don't understand and could use another explanation)

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
HashBr0wn
  • 387
  • 1
  • 11

1 Answers1

1

Your understanding of tasks and workers is correct. Job is a concept specific to SLURM (and other HPC clusters where users submit jobs). Job consists of the instruction of what to execute and what resources are needed, so the typical workflow of a SLURM user is to write a script and then submit it for execution using salloc or sbatch.

One can submit a job with instruction to launch multiple dask-workers (there might be advantages for this due to latency, permissions, resource availability, etc, but this would need to be determined from the particular cluster configuration).

From dask perspective what matters is the number of workers, but from dask-jobqueue the number of jobs also matters. For example, if number of workers per job is 2, then to get 10 workers in total dask-jobqueue will submit 5 jobs to the SLURM scheduler.

This example adapted from docs, will result in 10 dask-worker, each with 24 cores:

from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
    queue='regular',
    project="myproj",
    cores=24,
    processes=1,
    memory="500 GB"
)
cluster.scale(jobs=10)  # ask for 10 jobs

If we specify multiple processes, then the total number of workers will be jobs * processes (assuming sufficient cores), so the following will give 100 workers with 2 cores each and 50 GB per worker (note the memory in config is total):

from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
    queue='regular',
    project="myproj",
    cores=20,
    processes=10,
    memory="500 GB"
)
cluster.scale(jobs=10)  # ask for 10 jobs
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • So a worker is a single process in a job which is responsible for executing individual tasks pretty much? And in my understanding where I said a worker executes a task using whatever processes/threads required should be refined to just threads (because a worker is a single process)? – HashBr0wn May 04 '22 at 04:22
  • How does the n_workers configuration work. Docs say it defaults to 0? – HashBr0wn May 04 '22 at 04:24
  • On a separate note does walltime refer to the time to execute a single task or all the tasks in aggregate that are assigned to a worker? – HashBr0wn May 04 '22 at 04:29
  • `walltime` is also specific to the HPC system, dask does not limit the execution time for any given tasks. For n_workers it is set to 0 to avoid job submission at the time of cluster creation, if you put a positive value, then the number of jobs needed to achieve that many workers will be submitted. – SultanOrazbayev May 04 '22 at 10:56
  • Does a single worker execute `cores` many tasks at once? If I want to allow more than one thread per task do I set job_cpu to be higher than the default? – HashBr0wn May 04 '22 at 12:13