2

I have input files in this format:

dataset1/file1.bam
dataset1/file1_rmd.bam
dataset1/file2.bam
dataset1/file2_rmd.bam

I would like to run a command with each and merge the results into a csv file. The command returns an integer given filename.

$ samtools view -c -F1 dataset1/file1.bam
200

I would like to run the command and merge the output for each file into the following csv file:

file1,200,100
file2,400,300

I can do this without using the expand in input and using an append operator >>, but in order to avoid possible file corruptions it can lead to I would like to use >.

I tried something like this which did not work due to wildcards.param2 part:

rule collect_rc_results:
    input: in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
            in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
    output: "{param1}_merged.csv"
    shell:
        """
        RCT=$(samtools view -c -F1  {input.in1})
        RCD=$(samtools view -c -F1  {input.in2})
        printf "{wildcards.param2},${{RCT}},${{RCD}}\n" > {output}
        """

I am aware that the input is no longer a single file but a list of files created by expand. Therefore I defined a function to work on a list input, but it is still not quite right:

def get_read_count:
        return [ os.popen("samtools view -c -F1 "+infile).read() for infile in infiles ]

rule collect_rc_results:
    input: in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
            in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
    output: "{param1}_merged.csv"
    params: rc1=get_read_count("{param1}/{param2}.bam"), rc2=get_read_count("{param1}/{param2}_rmd.bam")
    shell:
        """
        printf "{wildcards.param2},{params.rc1},{params.rc2}\n" > {output}
        """

What is the best practice to use the wildcards inside the input file IDs when the input file list is defined with expand?

Edit: I can get the expected result with expand if I use an external bash script, such as script.sh

for INF in "${@}";do
        IN1=${INF}
        IN2=${IN1%.bam}_rmd.bam
        LIB=$(basename ${IN1%.*}|cut -d_ -f1)
        RCT=$(samtools view -c -F1 ${IN1} )
        RCD=$(samtools view -c -F1 ${IN2} )
        printf "${LIB},${RCT},${RCD}\n"
done

with

        params: script="script.sh"
        shell:
                """
                bash {params.script} {input} > {output} 
                """

but I am interested in learning if there is an easier way to get the same output only using snakemake.

Edit2:

I can also do it in the shell instead of a separate script,

rule collect_rc_results:
        input: 
        in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
        in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
        output: "{param1}_merged.csv"
        shell:
            """
            for INF in {input};do
                IN1=${{INF}}
                IN2=${{IN1%.bam}}_rmd.bam
                LIB=$(basename ${{IN1%.*}}|cut -d_ -f1)
                RCT=$(samtools view -c -F1 ${{IN1}} )
                RCD=$(samtools view -c -F1 ${{IN2}} )
                printf ${{LIB}},${{RCT}},${{RCD}}\n"
            done > {output}
            """

thus get the expected files. However, I would be interested to hear about it if anyone has a more elegant or a "best practice" solution.

Isin Altinkaya
  • 439
  • 4
  • 11
  • 1
    Using `expand` in the input of a rule is for when your rule will use all the resulting files at once. Typically, this is used in rules summarizing the output of previous rules. In snakemake, you likely should proceed in two steps: One rule does the computation on a single input, and stores the result in one file, another uses `expand` to gather all those results and put everything in one summary file. The intermediate files (output of the first rule) could be `temp` (See https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#protected-and-temporary-files). – bli May 02 '22 at 10:25

1 Answers1

1

I don't think there is anything really wrong with your current solution, but I would be more inclined to use a run directive with shell functions to perform your loop.

@bli's suggestion to use temp files is also good, especially if the intermediate step (samtools in this case) is long running; you can get tremendous wall-clock gains from parallelizing those computations. The downside is you will be creating lots of tiny files.

I noticed that your inputs are fully qualified through expand, but based on your example I think you want to leave param1 as a wildcard. Assuming PARS2 is a list, it should be safe to zip in1, in2 and PARS2 together. Here's my take (written but not tested).

rule collect_rc_results:
    input: 
        in1=expand("{param1}/{param2}.bam", param2=PARS2, allow_missing=True),
        in2=expand("{param1}/{param2}_rmd.bam", param2=PARS2, allow_missing=True)
    output: "{param1}_merged.csv"
    run:
        with open(output[0], 'w') as outfile:
            for infile1, infile2, parameter in zip(in1, in2, PARS2):
                # I don't usually use shell, may have to strip newlines from this output?
                RCT = shell(f'samtools view -c -F1 {infile1}')
                RCD = shell(f'samtools view -c -F1 {infile2}')
                outfile.write(f'{parameter},{RCT},{RCD}\n')

Troy Comi
  • 1,579
  • 3
  • 12