Main issue
I'm using Dask Jobequeue on a Slurm supercomputer. My workload includes a mix of threaded (i.e. numpy) and python workloads, so I think a balance of threads and processes would be best for my deployment (which is the default behaviour). However, in order for my jobs to run I need to use this basic configuration:
cluster = SLURMCluster(cores=20,
processes=1,
memory="60GB",
walltime='12:00:00',
...
)
cluster.adapt(minimum=0, maximum=20)
client = Client(cluster)
which is entirely threaded. The tasks also seem to take longer than I would naively expect (a large part of this is a lot of file reading/writing). Switching to purely processes, i.e.
cluster = SLURMCluster(cores=20,
processes=20,
memory="60GB",
walltime='12:00:00',
...
)
results in slurm jobs which are immediately killed by Slurm as they are launched, with the only output like:
slurmstepd: error: *** JOB 11116133 ON nid00201 CANCELLED AT 2021-04-29T17:23:25 ***
Choosing a balanced configuration (i.e. default)
cluster = SLURMCluster(cores=20,
memory="60GB",
walltime='12:00:00',
...
)
results in a strange intermediate behaviour. The task will run near to completion (i.e. 900/1000 work tasks) then a number of the workers will be killed, and the progress will drop back down to, say, 400/1000 tasks.
Any thoughts on what's going on here?
Some extra context
On this particular machine, I've been advised by the sysadmins that for multicore (non MPI, e.g. python multiprocessing) the jobs should be launched using
srun -n 1 -c 20 python ...
Otherwise the processes will run on a single core. So in my cluster config I have
cluster = SLURMCluster(
...
python='srun -n 1 -c 20 python',
...
)
My guess is that the difference between using dask threads/processes should affect this, as we still want all 20 cores assigned to job.
Additionally, I need to load a module which makes a number of compiled tools available, but unfortunately also changes the PYTHONPATH
. My (not at all ideal) workaround for this is:
cluster = SLURMCluster(
...
env_extra=[
'module load mymodule',
'unset PYTHONPATH',
'source /home/$(whoami)/.bashrc',
'conda activate mycondaenv'
],
...
)
This seems to successfully ensure that the python
called above is the same mycondaenv
environment that I use to launch the dask jobs.
EDIT:
Updates
Following @SultanOrazbayev's suggestion, I looked that the Slurm job info for the failed job ids. They all result in:
slurm_load_jobs error: Invalid job id specified
Further, I've found that using cluster.scale
rather than cluster.adapt
results in a successful run of the work. Perhaps the issue here is how adapt
is trying to scale the number of jobs?