Dask Jobqueue - Why does using processes result in cancelled jobs?

Question

Main issue

I'm using Dask Jobequeue on a Slurm supercomputer. My workload includes a mix of threaded (i.e. numpy) and python workloads, so I think a balance of threads and processes would be best for my deployment (which is the default behaviour). However, in order for my jobs to run I need to use this basic configuration:

cluster = SLURMCluster(cores=20,
                    processes=1,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )
cluster.adapt(minimum=0, maximum=20)
client = Client(cluster)

which is entirely threaded. The tasks also seem to take longer than I would naively expect (a large part of this is a lot of file reading/writing). Switching to purely processes, i.e.

cluster = SLURMCluster(cores=20,
                    processes=20,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )

results in slurm jobs which are immediately killed by Slurm as they are launched, with the only output like:

slurmstepd: error: *** JOB 11116133 ON nid00201 CANCELLED AT 2021-04-29T17:23:25 ***

Choosing a balanced configuration (i.e. default)

cluster = SLURMCluster(cores=20,
                    memory="60GB",
                    walltime='12:00:00',
                    ...
                    )

results in a strange intermediate behaviour. The task will run near to completion (i.e. 900/1000 work tasks) then a number of the workers will be killed, and the progress will drop back down to, say, 400/1000 tasks.

Any thoughts on what's going on here?

Some extra context

On this particular machine, I've been advised by the sysadmins that for multicore (non MPI, e.g. python multiprocessing) the jobs should be launched using

srun -n 1 -c 20 python ...

Otherwise the processes will run on a single core. So in my cluster config I have

cluster = SLURMCluster(
    ...
    python='srun -n 1 -c 20 python',
    ...
)

My guess is that the difference between using dask threads/processes should affect this, as we still want all 20 cores assigned to job.

Additionally, I need to load a module which makes a number of compiled tools available, but unfortunately also changes the PYTHONPATH. My (not at all ideal) workaround for this is:

cluster = SLURMCluster(
    ...
    env_extra=[
        'module load mymodule',
        'unset PYTHONPATH',
        'source /home/$(whoami)/.bashrc',
        'conda activate mycondaenv'
    ],
    ...
)

This seems to successfully ensure that the python called above is the same mycondaenv environment that I use to launch the dask jobs.

EDIT:

Updates

Following @SultanOrazbayev's suggestion, I looked that the Slurm job info for the failed job ids. They all result in:

slurm_load_jobs error: Invalid job id specified

Further, I've found that using cluster.scale rather than cluster.adapt results in a successful run of the work. Perhaps the issue here is how adapt is trying to scale the number of jobs?

score 0 · Answer 1 · answered Apr 30 '21 at 04:43

0

A possible reason is that the jobs are cancelled due to incompatible resource requests. The easiest way to resolve this is to inspect print(cluster.job_script()) and show it to your sysadmin.

Alternatively, check the error logs on the cluster. Running scontrol show jobid 1234 (where 1234 is the job id) should point you to the reasons for cancellation and relevant logs.

answered Apr 30 '21 at 04:43

SultanOrazbayev

14,900
3
16
46

Thanks. I was already looking at the job script, and they looked completely fine. I had an earlier issue where I was accidentally requesting too many jobs, and that resulted in a visible error in the queue. The `scrontrol` command is showing that the IDs of the failed jobs aren't valid (I guess they aren't registered by the system?). See updates above – Albatross May 03 '21 at 03:12
Given the additional details, your problem might be caused by some of the options you omitted in the question. For example the location of stdout/stderr files. – SultanOrazbayev May 03 '21 at 04:47
I'm not so sure. I just have `log_directory='logs'`, and the files are writing just fine. I think it has more to do with the behaviour of `adapt` vs `scale` (see the update above) – Albatross May 03 '21 at 06:22

Dask Jobqueue - Why does using processes result in cancelled jobs?

Main issue

Some extra context

Updates

1 Answers1