I have a similar goal as in Snakemake: unknown output/input files after splitting by chromosome , however, as pointed out, I do know in advance that I have e.g., 5 chromosomes in my sample.bam
file. Using as a toy example:
$ cat sample.bam
chromosome 1
chromosome 2
chromosome 3
chromosome 4
chromosome 5
I wish to "split" this bam file, and then do a bunch of per chromosome downstream jobs on the resulting chromosomes. The simplest solution I could conjure up was:
chromosomes = '1 2 3 4 5'.split()
rule master :
input :
expand('sample.REF_{chromosome}.bam',
chromosome = chromosomes)
rule chromosome :
output :
touch('sample.REF_{chromosome}.bam')
input : 'split.done'
rule split_bam :
output :
touch('split.done')
input : 'sample.bam'
run :
print('splitting bam..')
chromosome = 1
for line in open(input[0]) :
outfile = 'sample.REF_{}.bam'.format(chromosome)
print(line, end = '', file = open(outfile, 'w'))
chromosome += 1
results in empty sample_REF_{chromosome}.bam
files. I understand why this happens, and indeed snakemake even warns, e.g.,
Warning: the following output files of rule chromosome were not present when the DAG was created:
{'sample.REF_3.bam'}
Touching output file sample.REF_3.bam.
that is, these files were not in the DAG to begin with, and snakemake touches them with empty versions, erasing what was put there. I guess I am surprised by this behavior, and wonder if there is a good reason for this. Note that this behavior is not limited to snakemake's touch()
, since, should I replace touch('sample.REF_{chromosome}.bam')
with simply 'sample.REF_{chromosome}.bam'
, and then have a shell :
touch {output}`, I get the same result. Now, of course, I have found a perfectly acceptable workaround:
chromosomes = '1 2 3 4 5'.split()
rule master :
input :
expand('sample.REF_{chromosome}.bam',
chromosome = chromosomes)
rule chromosome :
output : 'sample.REF_{chromosome}.bam'
input : 'split_dir'
shell : 'mv {input}/{output} {output}'
rule split_bam :
output :
temp(directory('split_dir'))
input : 'sample.bam'
run :
print('splitting bam..')
shell('mkdir {output}')
chromosome = 1
for line in open(input[0]) :
outfile = '{}/sample.REF_{}.bam'.format(output[0], chromosome)
print(line, end = '', file = open(outfile, 'w'))
chromosome += 1
but I am surprised I have to go though these gymnastics for a seemingly simple task. For this reason, I wonder if there is a better design, or if I am not asking the right question. Any advice/ideas are most welcome.