The two solutions I will present are to sequentially combine input files and to recursively build intermediate files as a binary tree. For each, consider pasting together a few hundred samples with the following start of a snakefile:
ids = list('abcdefghijklmnopqrstuvwxyz')
samples = expand('{id1}{id2}', id1=ids, id2=ids) # 676 samples, need not be numbers
# aa, ab, ac, .., zz
rule all:
input: 'merged.txt'
rule generate_data:
output: 'sample_{sample}.txt'
shell:
'echo {wildcards.sample} > {output}'
Sequential Solution
The sequential solution is fairly easy to remember and understand. You combine files 1 and 2 into a temporary file, then combine the temporary file with file 3, ... until file N. You can do this with a run
directive and shell
commands but I will present this as just a shell
directive
rule merge:
input:
first_files=expand('sample_{sample}.txt', sample=samples[:2]),
rest_files=expand('sample_{sample}.txt', sample=samples[2:])
output: 'merged.txt'
shell:
'paste {input.first_files} > {output} \n'
'for file in {input.rest_files} ; do '
'paste {output} $file > {output}_tmp \n'
'mv {output}_tmp {output} \n'
'done '
Recursive Solution
The general idea behind the recursive solution is to combine files 1 and 2, 3 and 4, 5 and 6, ... in the first step them combine those intermediate files until one merged file is left. The difficulty is that snakemake evaluates the dag from the top down and the number of files may not be evenly divisible by 2.
rule merge:
"""Request final output from merged files 0 to N-1."""
input:
f'temp_0_{len(samples)-1}'
output: 'merged.txt'
shell:
'cp {input} {output}'
def merge_intermediate_input(wildcards):
"""From start and end indices, request input files. Raises ValueError when indices are equal."""
start, end = int(wildcards.start), int(wildcards.end)
if start == end: # perform link instead
raise ValueError
if start + 1 == end: # base case
return expand('sample_{sample}.txt',
sample=(samples[start], samples[end]))
# default
return [f'temp_{start}_{(start+end)//2}', f'temp_{(start+end)//2+1}_{end}']
rule merge_intermediate:
"""Solve subproblem, producing start to end."""
input: merge_intermediate_input
output: temp('temp_{start}_{end}')
shell:
'paste {input} > {output}'
def merge_base_input(wildcards):
"""Get input sample from index in list."""
index = int(wildcards.start)
return f'sample_{samples[index]}.txt'
rule merge_base:
"""Create temporary symbolic link for input file with start==end."""
input: merge_base_input
output: temp('temp_{start}_{start}')
shell:
'ln -sr {input} {output}'
merge_intermediate
solves the subproblem of producing the merged files from start
to end
from the two merged files split halfway. When start == end
, the merged file is created as a symbolic link. When start + 1 == end
, the base case is to merge the input files at those indices. The recursive solution is clearly more code and more complex, but it can be more efficient in long-running or complex merge operations.
Runtime Complexity, Performance
Let each of the N files have k lines and the runtime complexity of the merge operation have O(f(n))
. In the sequential solution, the temporary file is created N-1 times and its length increases as 2k, 3k ... for a total of k*N*(N+1)/2 - k ~ O(f(k N^2))
.
For the recursive solution, in the first layer, each pair of files is joined. Each operation requires O(f(2k))
and there are N/2
such operations. Next, each pair of the resulting files are merged, at a cost of O(f(4k))
with N/4
operations. Overall, ln(N)
layers of merges are required to produce the final output, again with N-1 merge operations. The complexity of the entire operation is O(f(k N ln(n)))
.
In terms of overhead, the recursive solution launches N-1 snakemake jobs with any associated calls to the scheduler, activating environments, etc. The sequential version launches a single job and runs everything in a single shell process.
The recursive solution can run with more parallelism; each 'level' of the recursive solution is independent, allowing up to N/2 jobs to run at once. The sequential solution requires the results of each previous step. There is an additional challenge with resource estimation for the recursive solution. The first merges have O(2k)
while the last has O(k N)
. The resources could be dynamically estimated or, if the merge step doesn't increase the resulting file size (e.g. intersecting regions), the resources could be similar.
Conclusion
While the recursive solution offers better asymptotic runtime complexity, it introduces more snakemake jobs, temporary files, and complex logic. The sequential solution is straight-forward and contained in a single job, though could be N/ln(N)
times slower. Quick merge operations can be successfully performed with the sequential solution and the runtime won't be much worse until N is quite large. However, if merging takes 10s of minutes or longer, depends on the input file sizes, and produces outputs longer than inputs (e.g. cat, paste, and similar), the recursive solution may offer better performance and a significantly shorter wall clock time.