How to efficiently combine N files, two at a time

Question

Based on the discussion in another question.

Some tools will only accept two input files at a time, but the final, merged output requires merging N output files. Examples include paste and some bed or vcf tools. Assume a list of samples is present and the binary operation is associative, (a+b)+c == a+(b+c). The required, merged output must be generated by repeatedly combining input and intermediate files. How can you efficiently merge the files?

score 1 · Answer 1 · answered Jul 04 '22 at 21:01

The two solutions I will present are to sequentially combine input files and to recursively build intermediate files as a binary tree. For each, consider pasting together a few hundred samples with the following start of a snakefile:

ids = list('abcdefghijklmnopqrstuvwxyz')
samples = expand('{id1}{id2}', id1=ids, id2=ids)  # 676 samples, need not be numbers
# aa, ab, ac, .., zz

rule all:
    input: 'merged.txt'

rule generate_data:
    output: 'sample_{sample}.txt'
    shell:
        'echo {wildcards.sample} > {output}'

Sequential Solution

The sequential solution is fairly easy to remember and understand. You combine files 1 and 2 into a temporary file, then combine the temporary file with file 3, ... until file N. You can do this with a run directive and shell commands but I will present this as just a shell directive

rule merge:
    input:
        first_files=expand('sample_{sample}.txt', sample=samples[:2]),
        rest_files=expand('sample_{sample}.txt', sample=samples[2:])
    output: 'merged.txt'
    shell:
        'paste {input.first_files} > {output} \n'
        'for file in {input.rest_files} ; do '
            'paste {output} $file > {output}_tmp \n'
            'mv {output}_tmp {output} \n'
        'done '

Recursive Solution

The general idea behind the recursive solution is to combine files 1 and 2, 3 and 4, 5 and 6, ... in the first step them combine those intermediate files until one merged file is left. The difficulty is that snakemake evaluates the dag from the top down and the number of files may not be evenly divisible by 2.

rule merge:
    """Request final output from merged files 0 to N-1."""
    input:
        f'temp_0_{len(samples)-1}'
    output: 'merged.txt'
    shell:
        'cp {input} {output}'

def merge_intermediate_input(wildcards):
    """From start and end indices, request input files.  Raises ValueError when indices are equal."""
    start, end = int(wildcards.start), int(wildcards.end)
    if start == end:  # perform link instead
        raise ValueError
    if start + 1 == end:  # base case
        return expand('sample_{sample}.txt',
                sample=(samples[start], samples[end]))
    # default
    return [f'temp_{start}_{(start+end)//2}', f'temp_{(start+end)//2+1}_{end}']

rule merge_intermediate:
    """Solve subproblem, producing start to end."""
    input: merge_intermediate_input
    output: temp('temp_{start}_{end}')
    shell:
        'paste {input} > {output}'

def merge_base_input(wildcards):
    """Get input sample from index in list."""
    index = int(wildcards.start)
    return f'sample_{samples[index]}.txt'

rule merge_base:
    """Create temporary symbolic link for input file with start==end."""
    input: merge_base_input
    output: temp('temp_{start}_{start}')
    shell:
        'ln -sr {input} {output}'

merge_intermediate solves the subproblem of producing the merged files from start to end from the two merged files split halfway. When start == end, the merged file is created as a symbolic link. When start + 1 == end, the base case is to merge the input files at those indices. The recursive solution is clearly more code and more complex, but it can be more efficient in long-running or complex merge operations.

Runtime Complexity, Performance

Let each of the N files have k lines and the runtime complexity of the merge operation have O(f(n)). In the sequential solution, the temporary file is created N-1 times and its length increases as 2k, 3k ... for a total of k*N*(N+1)/2 - k ~ O(f(k N^2)).

For the recursive solution, in the first layer, each pair of files is joined. Each operation requires O(f(2k)) and there are N/2 such operations. Next, each pair of the resulting files are merged, at a cost of O(f(4k)) with N/4 operations. Overall, ln(N) layers of merges are required to produce the final output, again with N-1 merge operations. The complexity of the entire operation is O(f(k N ln(n))).

In terms of overhead, the recursive solution launches N-1 snakemake jobs with any associated calls to the scheduler, activating environments, etc. The sequential version launches a single job and runs everything in a single shell process.

The recursive solution can run with more parallelism; each 'level' of the recursive solution is independent, allowing up to N/2 jobs to run at once. The sequential solution requires the results of each previous step. There is an additional challenge with resource estimation for the recursive solution. The first merges have O(2k) while the last has O(k N). The resources could be dynamically estimated or, if the merge step doesn't increase the resulting file size (e.g. intersecting regions), the resources could be similar.

Conclusion

While the recursive solution offers better asymptotic runtime complexity, it introduces more snakemake jobs, temporary files, and complex logic. The sequential solution is straight-forward and contained in a single job, though could be N/ln(N) times slower. Quick merge operations can be successfully performed with the sequential solution and the runtime won't be much worse until N is quite large. However, if merging takes 10s of minutes or longer, depends on the input file sizes, and produces outputs longer than inputs (e.g. cat, paste, and similar), the recursive solution may offer better performance and a significantly shorter wall clock time.

How to efficiently combine N files, two at a time

1 Answers1

Sequential Solution

Recursive Solution

Runtime Complexity, Performance

Conclusion

Linked