to speed up a my workflow I would like to do something similar to snakemake-unknown-output-input-files-after-splitting-by-chromosome
SAMPLE = "sample_a"
rule all:
expand("results/{sample}-final-combined.bam", sample=SAMPLE)
rule exclude_blacklist:
input:
region_file = "data/{sample}.bed",
blacklist = "data/blacklists/universal_blacklist.bed"
output:
"results/{sample}_blacklist-excluded.bed"
shell:
"""
bedtools intersect -v -a {input.region_file} -b {input.blacklist} > {output}
"""
rule target_regions_by_chrom:
input:
region_file ="results/{sample}_blacklist-excluded.bed",
output:
temp("results/{sample}_{chrom}.bed")
script:
"scripts/split_region_by_chrom.py"
rule simulate_reads:
input:
"results/{sample}_{chrom}.bed"
output:
"results/{sample}_{chrom}.bam"
script:
"scripts/simulate_reads.py"
rule combine_regions_again:
input:
files = expand("results/{sample}_{chrom}.bam", sample=SAMPLE, chrom=get_chroms())
output:
"results/{sample}-final-combined.bam"
script:
"scripts/combine_bams.py"
The workflow should take a .bed file with unfiltered regions, exclude problematic regions from a blacklist, split the filtered .bed file by chromosome and then simulate reads for these regions (for simplicity I spare you the details), generating .bam files. Ultimately these .bam files should be combined again to an output file.
The main problem is how to solve the unknown number of chromosomes beforehand. If I take the input .bed file, I get some errors, because some files will be empty. I thought about using an input function (here get_chroms()) to recover the chromosome from the output .bed file ("results/{sample}_blacklist-excluded.bed") after excluding the blacklisted regions, but this leads to errors during the creation of the DAG because the file is not existent.
Does anybody have any suggestions how to solve these problems:
- split a .bed file by chromosome, while not knowing the present chromosomes
- defining a wildcard based on an intermediate file in the workflow
Any help would be greatly appreciated!