0

I have a similar goal as in Snakemake: unknown output/input files after splitting by chromosome , however, as pointed out, I do know in advance that I have e.g., 5 chromosomes in my sample.bam file. Using as a toy example:

$ cat sample.bam 
chromosome 1
chromosome 2
chromosome 3
chromosome 4
chromosome 5

I wish to "split" this bam file, and then do a bunch of per chromosome downstream jobs on the resulting chromosomes. The simplest solution I could conjure up was:

chromosomes = '1 2 3 4 5'.split()

rule master :
    input :
        expand('sample.REF_{chromosome}.bam',
            chromosome = chromosomes)


rule chromosome :
    output :
        touch('sample.REF_{chromosome}.bam')

    input : 'split.done'


rule split_bam :
    output :
        touch('split.done')

    input : 'sample.bam'

    run :
        print('splitting bam..')
        chromosome = 1
        for line in open(input[0]) :
            outfile = 'sample.REF_{}.bam'.format(chromosome)
            print(line, end = '', file = open(outfile, 'w'))
            chromosome += 1

results in empty sample_REF_{chromosome}.bam files. I understand why this happens, and indeed snakemake even warns, e.g.,

Warning: the following output files of rule chromosome were not present when the DAG was created:
{'sample.REF_3.bam'}
Touching output file sample.REF_3.bam.

that is, these files were not in the DAG to begin with, and snakemake touches them with empty versions, erasing what was put there. I guess I am surprised by this behavior, and wonder if there is a good reason for this. Note that this behavior is not limited to snakemake's touch(), since, should I replace touch('sample.REF_{chromosome}.bam') with simply 'sample.REF_{chromosome}.bam', and then have a shell :touch {output}`, I get the same result. Now, of course, I have found a perfectly acceptable workaround:

chromosomes = '1 2 3 4 5'.split()

rule master :
    input :
        expand('sample.REF_{chromosome}.bam',
            chromosome = chromosomes)


rule chromosome :
    output : 'sample.REF_{chromosome}.bam'

    input : 'split_dir'

    shell : 'mv {input}/{output} {output}'


rule split_bam :
    output :
        temp(directory('split_dir'))

    input : 'sample.bam'

    run :
        print('splitting bam..')
        shell('mkdir {output}')
        chromosome = 1
        for line in open(input[0]) :
            outfile = '{}/sample.REF_{}.bam'.format(output[0], chromosome)
            print(line, end = '', file = open(outfile, 'w'))
            chromosome += 1

but I am surprised I have to go though these gymnastics for a seemingly simple task. For this reason, I wonder if there is a better design, or if I am not asking the right question. Any advice/ideas are most welcome.

1 Answers1

0

I think your example is a bit contrived for a couple of reasons. The rule split_bam already produces the final output sample.REF_{chromosome}.bam. Also, the rule master uses the chromosomes taken from the variable chromosomes whereas the rule split_bam iterates through the bam file to get the chromosomes.

My impression is that what you want could be something like:

chromosomes= '1 2 3 4 5'.split()

rule master:
    input:
        expand('sample.REF_{chromosome}.bam',
            chromosome = chromosomes)

rule split_bam :
    input:
        'sample.bam'
    output:
        expand('sample.split.{chromosome}.bam', chromosome= chromosomes)
    run:
        print('splitting bam..')
        for chromosome in chromosomes:
            outfile = 'sample.split.{}.bam'.format(chromosome)
            print(chromosome, end = '', file = open(outfile, 'w'))

rule chromosome:
    input:
        'sample.split.{chromosome}.bam'
    output:
        touch('sample.REF_{chromosome}.bam')
dariober
  • 8,240
  • 3
  • 30
  • 47