0

I am running snakemake (v7.6.2) and I noticed that, unlike its 'principles', it is attempting to re-run steps of a pipeline whose output files already exist.

In my first run I had the following DAG:

enter image description here

which finished successfully, but I now want to add another rule to it (quast_first), as shown in the following 'updated' DAG:

enter image description here

(I have done that by adding the output of quast_first as input for quast_second)

If I call a dry run, I'd be expecting the following rules to be re-executed:

  1. quast_first: output does not exist, it was not part of the previous workflow
  2. quast_second: although the output exists, it has a new dependency (quast_first), although, for this specific case, the output should be the exact same, as the output of quast_first is just a dependency (so no input) for quast_second

However, I see that snakemake wants to re-generate the whole workflow. Below is an extract from calling a dry run with the --reason flag, as explained in this question:

rule symLinkFQ:
    input: logs/BORD1725, /nexus/Gridion/20220420Microbiology_q20/no_sample/20220405_1846_X1_FAT23098_47b43b4a/High_accuracy_basecalling/pass/barcode04
    output: symLinkFq/BORD1725
    log: /home/ngs/tempSnakemake/20220420Microbiology_q20/logs/BORD1725
    jobid: 34
    reason: Updated input files: /nexus/Gridion/20220420Microbiology_q20/no_sample/20220405_1846_X1_FAT23098_47b43b4a/High_accuracy_basecalling/pass/barcode04
    wildcards: barcode=BORD1725
    resources: mem_mb=1000, disk_mb=1000, tmpdir=/tmp

ln -s /nexus/Gridion/20220420Microbiology_q20/no_sample/20220405_1846_X1_FAT23098_47b43b4a/High_accuracy_basecalling/pass/barcode04 symLinkFq/BORD1725

However, I can confirm that the output of the rule symLinkFQ does exist (workdir is /home/ngs/tempSnakemake/20220420Microbiology_q20),

[ngs@vngs20x ~/tempSnakemake/20220420Microbiology_q20]$ ll symLinkFq/BORD1725
lrwxrwxrwx. 1 ngs ngs 125 24. Mai 14:01 symLinkFq/BORD1725 -> /nexus/Gridion/20220420Microbiology_q20/no_sample/20220405_1846_X1_FAT23098_47b43b4a/High_accuracy_basecalling/pass/barcode04

so I don't quite understand what is meant by:

reason: Updated input files: /nexus/Gridion/20220420Microbiology_q20/no_sample/20220405_1846_X1_FAT23098_47b43b4a/High_accuracy_basecalling/pass/barcode04`, as shown above:

also at the end of the dry run it shows again that the whole workflow will be executed if I call it:

Job stats:
job               count    min threads    max threads
--------------  -------  -------------  -------------
all                   1              1              1
cat_fastq             4              1              1
flye                  4              1              1
minimap_first         4              1              1
minimap_second        4              1              1
quast_first           4              1              1
quast_second          4              1              1
racon_first           4              1              1
racon_second          4              1              1
symLinkFQ             4              1              1
total                37              1              1

I have been using previous versions of Snakemake (v.5.* mainly) and as far as I recall this is the first time that I encounter this issue (Snakemake re-running rules whose output files already exist). Can it be that this is version-related, that I now, for example, have to pass a command line argument to snakemake telling it not to re-generate output files that already exist (although I would always expect this to be the default behaviour)?

BCArg
  • 2,094
  • 2
  • 19
  • 37
  • To get the obvious out of the way - were there in fact any changes to `/nexus/Gridion/20220420Microbiology_q20/no_sample/20220405_1846_X1_FAT23098_47b43b4a/High_accuracy_basecalling/pass/barcode04`? That looks like it's a directory if I know my Nanopore outputs. (Is your sequencing/basecalling completely done?) You might try defining the path to the directory in `params` instead of `input` if you're sure it's going to be there/can handle it not being there if e.g. a barcode drops out completely and don't care about updated files within the dir. – KeyboardCat May 24 '22 at 13:32
  • no file was updated in any of those directories, and basecalling is long finished (sometime in April), so nothing is being written in those folders – BCArg May 24 '22 at 15:59
  • This could be a version issue, I might be wrong, but at some point (maybe six months ago) there was a bug in identifying rules to re-run. – SultanOrazbayev May 25 '22 at 04:35

1 Answers1

1

turns out that this is indeed version related. I removed snakemake v7.6.2 and isntalled v5.8.1 and snakemake no longer wants to repeat rules whose output files exist. Same bug is present at v7.8.0, which is the latest release version

BCArg
  • 2,094
  • 2
  • 19
  • 37
  • Thanks for opening [an issue](https://github.com/snakemake/snakemake/issues/1677) about it. – Wayne May 25 '22 at 15:34
  • Note that in reply to the filing of an issue, mbhall88 has posted [a workaround](https://github.com/snakemake/snakemake/issues/1677#issuecomment-1138019777) along with a justification. – Wayne May 27 '22 at 14:48