What is the best way to use expand() with one unknown variable in Snakemake?

Question

I am currently using Snakemake for a bioinformatics project. Given a human reference genome (hg19) and a bam file, I want to be able to specify that there will be multiple output files with the same name but different extensions. Here is my code

rule gridss_preprocess:
        input:
                ref=config['ref'],
                bam=config['bamdir'] + "{sample}.dedup.downsampled.bam",
                bai=config['bamdir'] + "{sample}.dedup.downsampled.bam.bai"
        output:
                expand(config['bamdir'] + "{sample}.dedup.downsampled.bam{ext}", ext = config['workreq'], sample = "{sample}")

Currently config['workreq'] is a list of extensions that start with "."

For example, I want to be able to use expand to indicate the following files

S1.dedup.downsampled.bam.cigar_metrics
S1.dedup.downsampled.bam.computesamtags.changes.tsv
S1.dedup.downsampled.bam.coverage.blacklist.bed
S1.dedup.downsampled.bam.idsv_metrics

I want to be able to do this for multiple sample files, S_. Currently I am not getting an error when I try to do a dry run. However, I am not sure if this will run properly.

Am I doing this right?

Eric C has a solution, but to add, the reason you are getting an error is that the string `"{sample}"` in expand is being interpreted as a list. If you have `sample=["{sample}"]` your code should also work. — Troy Comi, Jun 28 '22 at 13:00
`expand` seems to be a very frequent source of confusion for snakemake beginners, and it is easy to avoid using it by just having basic python programming knowledge. The snakemake documentation might be improved by first presenting a plain python version (list comprehension, for instance) before introducing expand as a convenient shortcut. — bli, Jun 30 '22 at 11:19

score 3 · Answer 1 · answered Jun 28 '22 at 11:58

expand() defines a list of files. If you're using two parameters, the cartesian product will be used. Thus, your rule will define as output ALL files with your extension list for ALL samples. Since you define a wildcard in your input, I think that what you want is all files with your extension for ONE sample. And this rule will be executed as many times as the number of samples.

You're mixing up wildcards and placeholders for the expand() function. You can define a wildcard inside an expand() by doubling the brackets:

rule all:
    input:  expand(config['bamdir'] + "{sample}.dedup.downsampled.bam{ext}", ext = config['workreq'], sample=SAMPLELIST)

rule gridss_preprocess:
    input:
            ref=config['ref'],
            bam=config['bamdir'] + "{sample}.dedup.downsampled.bam",
            bai=config['bamdir'] + "{sample}.dedup.downsampled.bam.bai"
    output:
            expand(config['bamdir'] + "{{sample}}.dedup.downsampled.bam{ext}", ext = config['workreq'])

This expand function will expand in list

{sample}.dedup.downsampled.bam.cigar_metrics
{sample}.dedup.downsampled.bam.computesamtags.changes.tsv
{sample}.dedup.downsampled.bam.coverage.blacklist.bed
{sample}.dedup.downsampled.bam.idsv_metrics

and thus define the wildcard sample to match the files in the input.

You can also use `expand("{sample}.dedup...bam{ext}", ext=config['workreq'], allow_missing=True)` instead of double curly braces. It's handy for filenames in your config that you can't easily modify or with multiple wildcards you don't want to expand. — Troy Comi, Jun 28 '22 at 12:59

What is the best way to use expand() with one unknown variable in Snakemake?

1 Answers1