DISCLAIMER: You want to read your pairings from a YAML file, however,
I advise against this. I couldn't figure out how to do it elegantly using YAML formatting. I have an ad-hoc way of doing it to pair my SNP and INDEL annotations, however, there is a lot of boiler plate code JUST so it can write it from the YAML. This was okay because the YAML variable is likely never edited, so maintenance in a pedantically formatted string is no longer important in this case.
I think the code you tried is just about right.
What I think is missing is the ability to "request" the correct pairings in your "rule all" input. I personally prefer to do this using Pandas. It is listed on the homepage of the Python Software Foundation, so it's a robust choice.
The pandas setup is very easy to maintain, it's a single file tab or space separated. Easier for the end user than formatting nest YAML files (What I think would be required if setup via YAML format). This is how I do it in my system. It scales indefinitely. I'll admit accessing the pandas object is a bit tricky, but I've provided the code for you. Just know that first layer of objects (The [#] in the 'sample[1][tumor]' call), the [0] I think is just meta data on the file being read. I have yet to find a use for it and otherwise just ignore it.
tree structure of workspace
(CentOS5-Compatible) [tboyarski@login3 Test]$ tree
.
|-- [tboyarsk 620 Aug 4 10:57] Snakefile
|-- [tboyarsk 47 Aug 4 10:52] config.yaml
|-- [tboyarsk 512 Aug 4 10:57] output
| |-- [tboyarsk 0 Aug 4 10:54] ABC.bam
| |-- [tboyarsk 0 Aug 4 10:53] TimNorm.bam
| |-- [tboyarsk 0 Aug 4 10:53] TimTum.bam
| `-- [tboyarsk 0 Aug 4 10:57] XYZ.bam
`-- [tboyarsk 36 Aug 4 10:49] sampleFILEpair.txt
sampleFILEpair.txt (Proof the sample names can be unrelated)
tumor normal
TimTum TimNorm
XYZ ABC
config.yaml
pathDIR: output
sampleFILE: sampleFILEpair.txt
Snakefile
from pandas import read_table
configfile: "config.yaml"
rule all:
input:
expand("{pathDIR}/{sample[1][tumor]}_{sample[1][normal]}.bam", pathDIR=config["pathDIR"], sample=read_table(config["sampleFILE"], " ").iterrows())
rule gatk_RealignerTargetCreator:
input:
"{pathGRTC}/{normal}.bam",
"{pathGRTC}/{tumor}.bam",
output:
"{pathGRTC}/{tumor}_{normal}.bam"
# wildcard_constraints:
# tumor = '[^_|-|\/][0-9a-zA-Z]*',
# normal = '[^_|-|\/][0-9a-zA-Z]*'
run:
call('touch ' + str(wildcard.tumor) + '_' + str(wildcard.normal) + '.bam', shell=True)
With the merging of wildcards, in the past, I have found it to be a source of cyclical dependencies, so I also always include wildcard_constraints when merging (essentially that's what we're doing). They aren't actually necessary here. The "rule all" contains no wildcards, and it is calling "gatk", so in this exact example where is no room for ambiguity, but if this rule connects with other rules utilizing wildcards, usually it can generate some funky DAG's.