How to generate multiple output files from single input file that does not share naming convention in snakemake?

Question

I have searched this for some time now, and this thread is the closest I got, but could not get working with my setup.

What I want to do:

I have one text file where every line has an ID and a data point

1234 data2
5678 data3
...

I want to collect the lines that correspond to certain IDs, which I have in my config file, and write them to their own files named according to the IDs value (1234 or 5678)

# config.yaml
IDs:
    ID1: 1234
    ID2: 5678

When I did this without snakemake, I just looped over the list of IDs in my bash script and grepped the text file for them, but I just cannot accomplish this with snakemake.

Either I have an issue with wildcards in target, or my expand function gives all of the IDs to the grep command in shell, or when following that accepted linked answer, I get "missing input files for rule all: And_Laa A_log" I can share what I have now, but I think the correct way to do this is so far removed from what I have, that it will just confuse everyone:

configfile: "config.yaml"

# Trying to replicate stackoverflow answer
speakers = {
  "1": "And_Laa",
  "2": "A_log"
}

def get_speaker(wildcards):
#  return expand("{speaker}", speaker=config["speakers"]) 
  return speakers[wildcards.speaker]

rule all:
  input:
#    expand("{speaker}_wav-list", speaker=config[speakers])
    expand("{speaker}", speaker=speakers.values())

# Selecting all the audiofiles for the speakers from a very large file
rule select_speaker_files:
  input:
    wav=config["files"]["wavs"]
  output:
    speaker="{speaker}_wav-list"
  params:
    speaker=get_speaker,
  shell:
    'grep "{params.speaker}" {input.wav} > {output.speaker}'

score 2 · Accepted Answer · answered Jul 21 '21 at 00:39

2

First of all, I guess that what you call a "speaker" is not a value of the dict, but it's key. So you rule all should expand like that:

rule all:
  input:
    expand("{speaker}", speaker=speakers)

Next, this rule literally says: "I require two files with the filenames 1 and 2." But there is no rule that produce the files with these names. You have:

rule select_speaker_files:
  output:
    speaker="{speaker}_wav-list"

Thir rule claims: "I can produce a file which name ends in _wav-list." Definitely there are no rules that may produce something that the pipeline needs to create. You probably meant that:

rule all:
  input:
    expand("{speaker}_wav-list", speaker=speakers)

In this case the rules at least are in consistency.

answered Jul 21 '21 at 00:39

Dmitry Kuzminov

6,180
6
18
40

Thanks! I did actually want the value, and the value to be in the grep command and output file. So I modified the Snakefile to this and it seems to work, though I think there must be another way give the wildcard to the shell command `def get_speaker(wildcards): return wildcards` `rule all: input: expand("{speaker}_wav-list", speaker=speakers.values())` `params: get_speaker,` `shell: 'grep "{params}" {input.wav} > {output.speaker}'` – juho Jul 21 '21 at 07:26
Why do you need the function `get_speaker` in this case if it just literally returns the argument? – Dmitry Kuzminov Jul 21 '21 at 07:45
You're right, I don't. I just did not know how to access the wildcard in the shell but apparently `grep "{wildcards.speaker}"` does that. I tried it before with just {wildcards} and some other ways but that did not work. – juho Jul 21 '21 at 09:33
The best (and recommended) way to access a wildcard from `shell` section is to use its name: `grep "{speaker}"`. – Dmitry Kuzminov Jul 21 '21 at 17:23
So I have now extended my script, and it seems to be working nicely. But if I try with just `{speaker}` instead of `{wildcards.speaker}` I get the `NameError: The name 'speaker' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}` – juho Jul 27 '21 at 11:52
To answer your question I need to know the whole script that produces this error. – Dmitry Kuzminov Jul 27 '21 at 13:45
I'm pretty sure I'll run into a different error with my script quite soon so I think I'll create another question then and link it here (especially since there is a character limit in comments). You have been way too helpful already :) – juho Jul 27 '21 at 18:14

How to generate multiple output files from single input file that does not share naming convention in snakemake?

1 Answers1