2

I have a workflow written in Snakemake for analyzing biological sequencing data. The workflow expects all the data files to be organized so that each raw read file begins with the type of assay (RNASeq, DNaseSeq, etc.) and this filename convention is maintained throughout all the files the workflow produces.

I have a rule to align the reads for data from every assay except RNASeq, and a different rule that should only be applied to RNASeq data. I'm been having trouble getting these rules set up so that snakemake knows which to use for which files.

In the RNASeq rule, I have this:

wildcard_constraints: library='RNASeq_.+'

and this works to make sure the RNASeq libraries use that rule. I'm still getting an error about ambiguous rules for other assays, though, so I think I need to constrain the wildcards in the other rules. I've tried this:

wildcard_constraints: library='(!?RNASeq)_.+'

to say match anything that doesn't have RNASeq, but while this works if I try it in the python interpreter, snakemake seems to not be able to match anything to this regex. I've tried it other ways, such as '[^R][^N][^A]' but can't get anything to work.

Since these regexes work when I try them manually against strings, I think there's either a bug with how snakemake applies regular expressions, or I don't understand something about how they are used by snakemake. I was assuming it was simply "If this regex matches the wildcard string, use this rule. If it doesn't, don't use this rule."

Colin
  • 10,447
  • 11
  • 46
  • 54
  • Actually, you might try `'^(?!RNASeq_).+'` or `'^(?!RNASeq_)'` – Wiktor Stribiżew Oct 20 '17 at 20:17
  • I have tried both of those too. This is definitely something specific to snakemake, either a bug or a misunderstanding about how it's using the regular expressions. – Colin Oct 20 '17 at 20:27
  • I believe `^` and `$` can't be used because snakemake will wrap your regex pattern with its own (e.g. see [here](https://bitbucket.org/snakemake/snakemake/src/56957cece56a51e4b5e2feb234d1721b87543211/snakemake/io.py?at=master&fileviewer=file-view-default#io.py-514)). You say "this works if I try it in the python interpreter", but does it? Both of these match: `re.search('(?!RNA)_.+', 'RNA_data')` and `re.search('(?!RNA)_.+', 'DNA_data')`... (PS: I'm trying to solve the same problem as you) – obk Feb 16 '18 at 06:32

2 Answers2

4

I believe the following demonstrates what you're trying to achieve:

# Snakefile

rule sam_startswith_dna:
    output: '{pattern}.sam'
    wildcard_constraints: pattern='dna.+'
    shell: 'touch {output}'

rule sam_not_startswith_dna:
    output: '{pattern}.sam'
    wildcard_constraints: pattern='(?!dna).+'  # negative lookahead assertion
    shell: 'touch {output}'

rule bam_endswith_rna:
    output: '{pattern}.bam'
    wildcard_constraints: pattern='.+rna'
    shell: 'touch {output}'

rule bam_not_endswith_rna:
    output: '{pattern}.bam'
    wildcard_constraints: pattern='.+(?<!rna)'  # negative lookbehind assertion
    shell: 'touch {output}'

Using it (snakemake 4.6.0, python 3.6):

$ snakemake -n dna_sample.sam   # runs rule: sam_startswith_sam

$ snakemake -n sample.sam       # runs rule: sam_not_startswith_sam
$ snakemake -n sample_dna.sam   # runs rule: sam_not_startswith_sam

$ snakeamke -n sample_rna.bam   # runs rule: bam_endswith_rna

$ snakemake -n sample.bam       # runs rule: bam_not_endswith_rna
$ snakemake -n rna_sample.bam   # runs rule: bam_not_endswith_rna

Here's what I think you were doing:

# Snakefile2

rule sam_startswith_dna_:
    output: '{pattern}.sam'
    wildcard_constraints: pattern='dna_.+'
    shell: 'touch {output}'

rule sam_not_startswith_dna_:
    output: '{pattern}.sam'
    wildcard_constraints: pattern='(?!dna)_.+'
    shell: 'touch {output}'

Using it:

$ snakemake -s Snakefile2 dna_data.sam  # runs rule: sam_startswith_dna_

$ snakemake -s Snakefile2 rna_data.sam  # raises MissingRuleException :( :( :(

Here's how you could have fixed it:

# Snakefile3

rule sam_startswith_dna_:
    output: '{pattern}.sam'
    wildcard_constraints: pattern='dna_.+'
    shell: 'touch {output}'

rule sam_not_startswith_dna_:
    output: '{pattern}.sam'
    wildcard_constraints: pattern='(?!dna)[^_]{3}_.+'
    shell: 'touch {output}'

Using it:

$ snakemake -s Snakefile3 -n dna_data.sam  # runs rule: sam_startswith_dna_

$ snakemake -s Snakefile3 -n rna_data.sam  # runs rule: sam_not_startswith_dna_

But it's not very general because of the hardcoded {3}:

$ snakemake -s Snakefile3 -n gdna_data.sam  # raises MissingRuleException

The following is based on my brief reading of snakemake.io.regex and some poking around; may contain errors

In general, given a rule like this:

rule some_rule:
    output: 'some.{pattern}.txt'
    wildcard_constraints: pattern='[a-z_]+'
    shell: 'touch {output}'

and a command line invocation like this:

$ snakemake some.tar_get.txt

the rule some_rule will be executed if

re.search('some\.(?P<pattern>[a-z_]+)\.txt$', 'some.tar_get.txt')

returns a match (assuming other checks pass (eg ambiguity, cyclic dag, etc)).

Interestingly, $ is appended to the pattern, but ^ isn't prepended.

This behavior was different from what I initially thought, which went something like this (this would allow the use of ^ and $ in your wildcard_constraints):

# python3, pseudo-code-ish

output = 'some.{pattern}.txt'
pattern = '[a-z_]+'

target = 'some.tar_get.txt'

# First test: does the target file name match the output (without the constraint)?
m = re.search('some\.(?P<pattern>.+)\.txt', target)
if not m:
    raise MissingInputException

# Second test: does the wildcard satisfy user-supplied constraint?
m = re.search(pattern, m.group('pattern'))
if not m:
    raise MissingInputException

run_rule()
obk
  • 488
  • 5
  • 12
1

If you don't want your lines to start with RNASeq or DNaseSeq, you can do

r'^(?!RNASeq)(?!DNaseSeq).+'
azizj
  • 3,428
  • 2
  • 25
  • 33
  • I only want to check that it doesn't start with RNASeq. I have multiple regular expressions that correctly check this if I use python's re package myself directly in the interpreter. I think this must be something specific to snakemake. – Colin Oct 20 '17 at 20:29