2

Within my snakemake pipeline I'm trying to retrieve the correct wildcards. I've looked into wildcard_constraints and this post and this post, however I can't figure out the exact solution.

Here's an example of file names within 2 datasets. 1 dataset contains paired mouse RNAseq read files and another dataset contains human paired RNAseq read files.

"Mus_musculus" dataset is "PRJNA362883_GSE93946_SRP097621" with file names:

"SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz" "SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz"

"Homo_sapiens" dataset is "PRJNA362883_GSE93946_SRP097621" with file names:

"SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1.fastq.gz" "SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2.fastq.gz"

I would like glob_wildcards to spit out the following wildcards

> ['PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621',
> 'PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872']
> ['SRR5195524_GSM2465521_KrasT_45649_NoDox',
> 'SRR5195524_GSM2465521_KrasT_45649_NoDox',
> 'SRR7942395_GSM3406786_sAML_Control_1',
> 'SRR7942395_GSM3406786_sAML_Control_1'] ['Mus_musculus', 'Mus_musculus',
> 'Homo_sapiens', 'Homo_sapiens'] ['1', '2', '1', '2']

I've tried the following code:

> import glob import os
> 
> DATASET,SAMPLE,SPECIES,FRR,
> =glob_wildcards(config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_{frr}.fastq.gz")
> print(DATASET,SAMPLE,SPECIES,FRR)

However, I get this as output. The many underscores messes up the glob_wildcards

> ['PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621',
> 'PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872']
> ['SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus',
> 'SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus',
> 'SRR7942395_GSM3406786_sAML_Control_1_Homo',
> 'SRR7942395_GSM3406786_sAML_Control_1_Homo'] ['musculus', 'musculus',
> 'sapiens', 'sapiens'] ['1', '2', '1', '2']

I for example tried this, but the output stays the same: wildcard_constraints: species = '!(sapiens)'

Could anyone suggest the correct code to get the wanted wildcards? Thanks in advance!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
manvenk
  • 69
  • 6

2 Answers2

1

If suitable for your case, it would be better to prepare a sample sheet or yaml file that indicates the characteristics of each file (e.g. the species, dataset, etc.). Then use python code (e.g. using pandas) to extract the wildcard values and direct the pipeline.

In my opinion, extracting information from filenames is brittle unless the names have been created within the pipeline and you have full control over them.

dariober
  • 8,240
  • 3
  • 30
  • 47
  • Thanks @dariober, you're a legend, I like the suggestion of a sample sheet! I'll check out this https://stackoverflow.com/questions/48909234/manually-create-snakemake-wildcards – manvenk Aug 06 '22 at 10:28
0

The naming scheme is a mess, so I'd fix that first and ensure that you do not ask underscores to perform two functions (separate words in a given variable and separate variables).

If that is not possible, wildcard_constraints is one solution:

wildcard_constraints:
    species="Mus_musculus|Homo_sapiens"
    # can try variations with r"" or escaping the pipe

Of course this would need to be adjusted if you have more species.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • I agree, the naming scheme is a mess, but I'd rather not manually change file names after downloading them from SRA-explorer since I'll be having a few hundred files in the end. Unfortunately the output stays unchanged after including the wildcard_constraints. – manvenk Aug 06 '22 at 09:39
  • That's odd, could you try variations with adding `r` in front of the string and/or inserting a backslach in front of the pipe? (see the updated answer) – SultanOrazbayev Aug 06 '22 at 09:41
  • Thanks for the suggestion. I've tried variations of with and without the r and/or backslash without success. Output mysteriously stays the same. I think I'll look into making a sample sheet, because the naming get's too complex with hundreds of samples. – manvenk Aug 06 '22 at 10:34