4

I want a rule to perform realignment between normal and tumor. The main problem is I don't know how to manage that problem. Is it the wildcard or the expand the answer to my problem?

This is my list of samples:

conditions:
   pair1:
        tumor: "432"
        normal: "433"

So the rule need to be something like this

rule gatk_RealignerTargetCreator:
    input:
        expand("mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",sample=config['conditions']['pair1']['tumor']),
        "mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",sample=config['conditions']['pair1']['normal']),

    output:
        "mapped_reads/merged_samples/{pair1}.realign.intervals"

How can I do this operation for all keys on conditions? (I suppose to have more that one pair)

I have tried this code:

    input:
        lambda wildcards: config["conditions"][wildcards.condition],
        tumor= expand("mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",tumor=config['conditions'][wildcards.condition]['tumor']),
        normal = expand("mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",normal=config['conditions'][wildcards.condition]['normal']),

    output:
        "mapped_reads/merged_samples/{tumor}/{tumor}_{normal}.realign.intervals"

name 'wildcards' is not defined

??

m00am
  • 5,910
  • 11
  • 53
  • 69
mau_who
  • 315
  • 2
  • 13
  • `wildcard` is not defined in the input elements. It can work as a function parameter, as in your `lambda`, but not directly in your `tumor` and `normal` elements of `input`. (I see a closing parenthesis with no matching opening parenthesis on the second line of the input of `gatk_RealignerTargetCreator`.) – bli Aug 08 '17 at 07:29
  • @bli what do you suggest for resolve this? – mau_who Aug 08 '17 at 14:40
  • Remove the `lamba willdcards: ...` thing from the input. Define a function outside the rule that takes `wildcards` as its only argument, uses config to determine the list of possible values for `tumor` given the value of `wildcards.condition` and does the `expand` using this list. Use that function as `tumor` in the input. Do similarly for `normal`. – bli Aug 09 '17 at 08:30
  • I added an attempt at answering your question with some code. – bli Aug 09 '17 at 09:36

2 Answers2

6

wildcards is not "directly" defined in the input of a rule. You need to use a function of wildcards instead. I'm not sure I understand exactly what you want to do, but you may try something like that.

def condition2tumorsamples(wildcards):
    return expand(
        "mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
        sample=config['conditions'][wildcards.condition]['tumor'])

def condition2normalsamples(wildcards):
    return expand(
        "mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
        sample=config['conditions'][wildcards.condition]['normal'])

rule gatk_RealignerTargetCreator:
    input:
        tumor = condition2tumorsamples,
        normal = condition2normalsamples,    
    output:
        "mapped_reads/merged_samples/{condition}.realign.intervals"
    # remainder of the rule here...
bli
  • 7,549
  • 7
  • 48
  • 94
0

DISCLAIMER: You want to read your pairings from a YAML file, however, I advise against this. I couldn't figure out how to do it elegantly using YAML formatting. I have an ad-hoc way of doing it to pair my SNP and INDEL annotations, however, there is a lot of boiler plate code JUST so it can write it from the YAML. This was okay because the YAML variable is likely never edited, so maintenance in a pedantically formatted string is no longer important in this case.

I think the code you tried is just about right. What I think is missing is the ability to "request" the correct pairings in your "rule all" input. I personally prefer to do this using Pandas. It is listed on the homepage of the Python Software Foundation, so it's a robust choice.

The pandas setup is very easy to maintain, it's a single file tab or space separated. Easier for the end user than formatting nest YAML files (What I think would be required if setup via YAML format). This is how I do it in my system. It scales indefinitely. I'll admit accessing the pandas object is a bit tricky, but I've provided the code for you. Just know that first layer of objects (The [#] in the 'sample[1][tumor]' call), the [0] I think is just meta data on the file being read. I have yet to find a use for it and otherwise just ignore it.

tree structure of workspace

(CentOS5-Compatible) [tboyarski@login3 Test]$ tree
.
|-- [tboyarsk       620 Aug  4 10:57]  Snakefile
|-- [tboyarsk        47 Aug  4 10:52]  config.yaml
|-- [tboyarsk       512 Aug  4 10:57]  output
|   |-- [tboyarsk         0 Aug  4 10:54]  ABC.bam
|   |-- [tboyarsk         0 Aug  4 10:53]  TimNorm.bam
|   |-- [tboyarsk         0 Aug  4 10:53]  TimTum.bam
|   `-- [tboyarsk         0 Aug  4 10:57]  XYZ.bam
`-- [tboyarsk        36 Aug  4 10:49]  sampleFILEpair.txt

sampleFILEpair.txt (Proof the sample names can be unrelated)

tumor normal
TimTum TimNorm
XYZ ABC

config.yaml

pathDIR: output
sampleFILE: sampleFILEpair.txt

Snakefile

 from pandas import read_table

 configfile: "config.yaml"

 rule all:
     input:
         expand("{pathDIR}/{sample[1][tumor]}_{sample[1][normal]}.bam", pathDIR=config["pathDIR"], sample=read_table(config["sampleFILE"], " ").iterrows())


 rule gatk_RealignerTargetCreator:
     input:
         "{pathGRTC}/{normal}.bam",
         "{pathGRTC}/{tumor}.bam",
     output:
         "{pathGRTC}/{tumor}_{normal}.bam"
 #    wildcard_constraints:
 #        tumor = '[^_|-|\/][0-9a-zA-Z]*',
 #        normal = '[^_|-|\/][0-9a-zA-Z]*'
     run:
         call('touch ' + str(wildcard.tumor) + '_' + str(wildcard.normal) + '.bam', shell=True)

With the merging of wildcards, in the past, I have found it to be a source of cyclical dependencies, so I also always include wildcard_constraints when merging (essentially that's what we're doing). They aren't actually necessary here. The "rule all" contains no wildcards, and it is calling "gatk", so in this exact example where is no room for ambiguity, but if this rule connects with other rules utilizing wildcards, usually it can generate some funky DAG's.

TBoyarski
  • 441
  • 3
  • 9
  • @TBoyarsky, I try to se what you suggest. I have this error snakemake --configfile config.tardis2.yaml --directory WD -np InputFunctionException in line 17 of WD/rules/samfiles.rules: KeyError: '432/432_433' Wildcards: sample=432/432_433 [scripts] (https://pastebin.com/7CwtmkUn) – mau_who Aug 09 '17 at 21:50
  • I think this has to do with greedy wildcards. "Wildcards: sample=432/432_433" Sample should not contain any slashes "/". Your pastebin isn't using wildcard constraints. They are important to ensure {pathDIR} becomes "432" and {sample} only becomes "432_433" – TBoyarski Aug 10 '17 at 16:47
  • I use two wildcards and in other rules I use only ne wildcard. How can I resolve..thanks for any help – mau_who Sep 20 '17 at 22:04
  • It doesn't matter how many wildcards you use. What matters is that the wildcards contained within the desired output, cannot contain slashes. Snakemake uses output file name as a way to name the job. If the wildcard is apart of the output file name, it will be included. [But since "/" are not allowed in UNIX file names](https://unix.stackexchange.com/questions/155793/is-it-correct-to-use-certain-special-characters-when-naming-filenames-in-linux) you get an error. You cannot make a file "432/432_432output.sh". And that's what you are trying to do because your wildcard is "432/432_433". – TBoyarski Sep 21 '17 at 00:29
  • @tboyarskiI have a question.If Could you help me to understand this command. If I use sample[1]['tumor'] on ipython don't work...They expect tan header? sample=read_table(config["conditions"], ",").iterrows()) – mau_who Sep 25 '17 at 19:20