0

I'm building an SQL script out of text data. The (part of) script shall consist of a CREATE TABLE statement and an optional INSERT INTO statement. The values for INSERT INTO statement are taken from the list of files, each one may exist or may not; all values of existing files are merged. The crucial part is that the INSERT INTO statement shall be skipped whenever no one data file exists.

I've created a script in Snakemake that does that. There are two ambiguous rules that create a script: the one that creates a script for empty data, and the one that creates table but inserts data (the ambiguity is resolved with ruleorder statement).

The interesting part is the rule that merges values from data files. It shall create the output whenever at least one input is present, and this rule shall not be considered otherwise. There are two difficulties: to make each input optional, and to prevent Snakemake using this rule whenever no files exist. I've done that with a trick:

def require_at_least_one(filelist):
    existing = [file for file in filelist if os.path.isfile(file)]
    return existing if len(existing) else "non_existing_file"

rule merge_values:
    input: require_at_least_one(expand("path_to_data/{dataset}/values", dataset=["A", "B", "C"]))
    output: ...
    shell: ...

The require_at_least_one function takes a list of filenames, and filters out those filenames that don't represent a file. This allows to make each input optional. For the corner case when no one file exists, this function returns a special value that represents a non-existing file. This allows to prune this branch and prefer the one that creates a script without INSERT statement.

I feel like reinventing the wheel, moreover the "non_existing_file" trick looks a little dirty. Are there better and idiomatic ways to do that in Snakemake?

Dmitry Kuzminov
  • 6,180
  • 6
  • 18
  • 40
  • For what its worth, that's also how we solved it in [seq2science](https://github.com/vanheeringen-lab/seq2science). We haven't found a more idiomatic way of doing this since.. – Maarten-vd-Sande Dec 11 '20 at 13:39

1 Answers1

0

my solution would be something along the lines that you should not force snakemake to use or not to use a rule inside the rule but specify which outputs do you need and snakemake will decide if it needs to use the rule. So for your example, I would do something as:

def required_files(filelist):
    return [file for file in filelist if os.path.isfile(file)]

rule what_to_gen:
    input: 
        merged = [] if required_files(expand("path_to_data/{dataset}/values", dataset=["A", "B", "C"])) else 'merged_files.txt'

rule merge_values:
    input: required_files(expand("path_to_data/{dataset}/values", dataset=["A", "B", "C"]))
    output: 'merged_files.txt'
    shell: ...

This will execute the rule merge_values only if required_files is non-empty.

silence
  • 101
  • 1
  • 6
  • Your solution is too straightforward and it works in just trivial cases like this. What would you do if the rules have 3 levels? What if I have two rules with `ruleorder`, one depends on `merged_files.txt` while another one doesn't? – Dmitry Kuzminov Feb 13 '21 at 01:03