Self-answer with my current solution.
My difficulty using DVC with Slurm jobs is that DVC runs stage commands serially (unless you get into queuing experiments, which introduces celery, which would be another queue on top of Slurm ... yikes.). If the stage commands run in the background, however, DVC will chug merrily along. But, you now have to manually enforce the DAG. I did this with advisory file system locking. You also don't want to run DVC commit until the backgrounded commands have completed.
Here's a pipeline with three stages (minimal working examples of <CMD>
given below), note that the DAG allows stages one
and three
to run in parallel while two
must run after one
.
stages:
one:
cmd: flock lock/a <ONE> &
outs:
- one.txt
two:
cmd: flock lock/a <TWO> &
deps:
- one.txt
outs:
- two.txt
three:
cmd: flock lock/b <THREE> &
outs:
- three.txt
The lock/a
and lock/b
files are created by the flock
command and correspond to the two separate branches of the DAG. Using flock
may not be the ultimate solution; the release order of multiple stage commands waiting on the same lock is unclear to me.
Wrap your dvc repro
command in a script something like this:
#!/bin/sh
set -e
mkdir lock
dvc repro --no-commit
for item in lock/*
do
flock $item rm $item
done
rmdir lock
This script would be your sbatch
submission script, but I'm leaving all that out. I'll also leave out the srun
part of the minimal working example below, but you'd need them for Slurm in your stage commands.
When you source job.sh
(or sbatch job.sh
), the commands all fire into the background and DVC exits. The flock mechanism takes over for releasing commands to run, and the script exits after all locks are released (and cleaned up). You would then run dvc commit
.
Here's an example that works without Slurm:
stages:
one:
cmd: flock lock/a ./stamp.sh </dev/null >one.txt &
outs:
- one.txt
two:
cmd: flock lock/a ./stamp.sh <one.txt >two.txt &
deps:
- one.txt
outs:
- two.txt
three:
cmd: flock lock/b ./stamp.sh </dev/null >three.txt &
outs:
- three.txt
With executable stamp.sh
:
#!/bin/sh
echo "time now is $(date +'%T')"
read line
echo $line | sed -e "s/now is/then was/"
sleep 10
Some results:
% source job.sh
Running stage 'three':
> flock lock/b ./stamp.sh </dev/null >three.txt &
WARNING: 'three.txt' is empty.
Running stage 'one':
> flock lock/a ./stamp.sh </dev/null >one.txt &
WARNING: 'one.txt' is empty.
Running stage 'two':
> flock lock/a ./stamp.sh <one.txt >two.txt &
WARNING: 'two.txt' is empty.
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.lock
To enable auto staging, run:
dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
% grep "time" *.txt
one.txt:time now is 11:38:58
three.txt:time now is 11:38:58
two.txt:time now is 11:39:08
two.txt:time then was 11:38:58