I have the following basic structure of the workflow:
- files are downloaded from a remote server,
- converted locally and then
- analyzed.
One of the analyses is time-consuming, but it scales well if run on multiple input files at a time. The output of this rule is independent of what files are analyzed together as a batch as long as they all share the same set of settings. Upstream and downstream rules operate on individual files, so from the perspective of the workflow, this rule is an outlier. What files are to be run together can told in advance, although ideally if some of the inputs failed to be produced along the way, the rule should be run on a reduced of files.
The following example illustrates the problem:
samples = [ 'a', 'b', 'c', 'd', 'e', 'f' ]
groups = {
'A': samples[0:3],
'B': samples[3:6]
}
rule all:
input:
expand("done/{sample}.txt", sample = samples)
rule create:
output:
"created/{sample}.txt"
shell:
"echo {wildcards.sample} > {output}"
rule analyze:
input:
"created/{sample}.txt"
output:
"analyzed/{sample}.txt"
params:
outdir = "analyzed/"
shell:
"""
sleep 1 # or longer
parallel md5sum {{}} \> {params.outdir}/{{/}} ::: {input}
"""
rule finalize:
input:
"analyzed/{sample}.txt"
output:
"done/{sample}.txt"
shell:
"touch {output}"
The rule analyze
is the one to produce multiple output files from multiple inputs according to the assignment in groups
. The rules create
and finalize
operate on individual files upstream and downstream, respectively.
Is there a way to implement such logic? I'd try like to try to avoid splitting the workflow to accommodate this irregularity.
Note: this question is not related to the similar sounding question here.