I am running snakemake (v7.6.2) and I noticed that, unlike its 'principles', it is attempting to re-run steps of a pipeline whose output files already exist.
In my first run I had the following DAG:
which finished successfully, but I now want to add another rule to it (quast_first
), as shown in the following 'updated' DAG:
(I have done that by adding the output of quast_first
as input for quast_second
)
If I call a dry run, I'd be expecting the following rules to be re-executed:
quast_first
: output does not exist, it was not part of the previous workflowquast_second
: although the output exists, it has a new dependency (quast_first
), although, for this specific case, the output should be the exact same, as the output ofquast_first
is just a dependency (so no input) forquast_second
However, I see that snakemake wants to re-generate the whole workflow. Below is an extract from calling a dry run with the --reason
flag, as explained in this question:
rule symLinkFQ:
input: logs/BORD1725, /nexus/Gridion/20220420Microbiology_q20/no_sample/20220405_1846_X1_FAT23098_47b43b4a/High_accuracy_basecalling/pass/barcode04
output: symLinkFq/BORD1725
log: /home/ngs/tempSnakemake/20220420Microbiology_q20/logs/BORD1725
jobid: 34
reason: Updated input files: /nexus/Gridion/20220420Microbiology_q20/no_sample/20220405_1846_X1_FAT23098_47b43b4a/High_accuracy_basecalling/pass/barcode04
wildcards: barcode=BORD1725
resources: mem_mb=1000, disk_mb=1000, tmpdir=/tmp
ln -s /nexus/Gridion/20220420Microbiology_q20/no_sample/20220405_1846_X1_FAT23098_47b43b4a/High_accuracy_basecalling/pass/barcode04 symLinkFq/BORD1725
However, I can confirm that the output of the rule symLinkFQ
does exist (workdir is /home/ngs/tempSnakemake/20220420Microbiology_q20
),
[ngs@vngs20x ~/tempSnakemake/20220420Microbiology_q20]$ ll symLinkFq/BORD1725
lrwxrwxrwx. 1 ngs ngs 125 24. Mai 14:01 symLinkFq/BORD1725 -> /nexus/Gridion/20220420Microbiology_q20/no_sample/20220405_1846_X1_FAT23098_47b43b4a/High_accuracy_basecalling/pass/barcode04
so I don't quite understand what is meant by:
reason: Updated input files: /nexus/Gridion/20220420Microbiology_q20/no_sample/20220405_1846_X1_FAT23098_47b43b4a/High_accuracy_basecalling/pass/barcode04`, as shown above:
also at the end of the dry run it shows again that the whole workflow will be executed if I call it:
Job stats:
job count min threads max threads
-------------- ------- ------------- -------------
all 1 1 1
cat_fastq 4 1 1
flye 4 1 1
minimap_first 4 1 1
minimap_second 4 1 1
quast_first 4 1 1
quast_second 4 1 1
racon_first 4 1 1
racon_second 4 1 1
symLinkFQ 4 1 1
total 37 1 1
I have been using previous versions of Snakemake (v.5.* mainly) and as far as I recall this is the first time that I encounter this issue (Snakemake re-running rules whose output files already exist). Can it be that this is version-related, that I now, for example, have to pass a command line argument to snakemake telling it not to re-generate output files that already exist (although I would always expect this to be the default behaviour)?