I have input files in this format:
dataset1/file1.bam
dataset1/file1_rmd.bam
dataset1/file2.bam
dataset1/file2_rmd.bam
I would like to run a command with each and merge the results into a csv file. The command returns an integer given filename.
$ samtools view -c -F1 dataset1/file1.bam
200
I would like to run the command and merge the output for each file into the following csv file:
file1,200,100
file2,400,300
I can do this without using the expand in input and using an append operator >>
, but in order to avoid possible file corruptions it can lead to I would like to use >
.
I tried something like this which did not work due to wildcards.param2
part:
rule collect_rc_results:
input: in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
output: "{param1}_merged.csv"
shell:
"""
RCT=$(samtools view -c -F1 {input.in1})
RCD=$(samtools view -c -F1 {input.in2})
printf "{wildcards.param2},${{RCT}},${{RCD}}\n" > {output}
"""
I am aware that the input is no longer a single file but a list of files created by expand
. Therefore I defined a function to work on a list input, but it is still not quite right:
def get_read_count:
return [ os.popen("samtools view -c -F1 "+infile).read() for infile in infiles ]
rule collect_rc_results:
input: in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
output: "{param1}_merged.csv"
params: rc1=get_read_count("{param1}/{param2}.bam"), rc2=get_read_count("{param1}/{param2}_rmd.bam")
shell:
"""
printf "{wildcards.param2},{params.rc1},{params.rc2}\n" > {output}
"""
What is the best practice to use the wildcards inside the input file IDs when the input file list is defined with expand?
Edit:
I can get the expected result with expand if I use an external bash script, such as script.sh
for INF in "${@}";do
IN1=${INF}
IN2=${IN1%.bam}_rmd.bam
LIB=$(basename ${IN1%.*}|cut -d_ -f1)
RCT=$(samtools view -c -F1 ${IN1} )
RCD=$(samtools view -c -F1 ${IN2} )
printf "${LIB},${RCT},${RCD}\n"
done
with
params: script="script.sh"
shell:
"""
bash {params.script} {input} > {output}
"""
but I am interested in learning if there is an easier way to get the same output only using snakemake.
Edit2:
I can also do it in the shell
instead of a separate script,
rule collect_rc_results:
input:
in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
output: "{param1}_merged.csv"
shell:
"""
for INF in {input};do
IN1=${{INF}}
IN2=${{IN1%.bam}}_rmd.bam
LIB=$(basename ${{IN1%.*}}|cut -d_ -f1)
RCT=$(samtools view -c -F1 ${{IN1}} )
RCD=$(samtools view -c -F1 ${{IN2}} )
printf ${{LIB}},${{RCT}},${{RCD}}\n"
done > {output}
"""
thus get the expected files. However, I would be interested to hear about it if anyone has a more elegant or a "best practice" solution.