6

I am getting started with Snakemake and I have a very basic question which I couldnt find the answer in snakemake tutorial.

I want to create a single rule snakefile to download multiple files in linux one by one. The 'expand' can not be used in the output because the files need to be downloaded one by one and wildcards can not be used because it is the target rule.

The only way comes to my mind is something like this which doesnt work properly. I can not figure out how to send the downloaded items to specific directory with specific names such as 'downloaded_files.dwn' using {output} to be used in later steps:

links=[link1,link2,link3,....]
rule download:    
output: 
    "outdir/{downloaded_file}.dwn"
params: 
    shellCallFile='callscript',
run: 
    callString=''
    for item in links:
        callString+='wget str(item) -O '+{output}+'\n'
    call('echo "' + callString + '\n" >> ' + params.shellCallFile, shell=True)
    call(callString, shell=True)

I appreciate any hint on how this should be solved and which part of snakemake I didnt understand well.

Masih
  • 920
  • 2
  • 19
  • 36
  • 1
    If you don't run snakemake with the `-j` option, only one rule instance will be run at a given time. Do the files need to be downloaded in a precise order ? – bli Jun 16 '17 at 12:22
  • Also, it is common to use a first `all` rule having only input, for which you can use expand. This will drive the rest of the workflow. – bli Jun 16 '17 at 12:25
  • Is there a pattern in the name of the links that can be used to decide the names of the downloaded files? Remember that Snakemake is meant to work with regularity in the file names. – bli Jun 16 '17 at 12:33
  • 1
    The rule you show has a problem in that `output` corresponds to a single file name, but your `callString` will consist in several calls to `wget` with always the same `-O` argument. Also, the `{downloaded_file}` part will make Snakemake have a wildcard named "downloaded_file", and will be unable to determine its value without further information. You should probably try to simplify your rule for a start. What would you do if you had only one link? – bli Jun 16 '17 at 12:42
  • Yet another observation: your `params.shellCallFile` probably should be a `log` and not a `params`. See for instance https://stackoverflow.com/a/42839257/1878788 – bli Jun 16 '17 at 12:45

1 Answers1

5

Here is a commented example that could help you solve your problem:

# Create some way of associating output files with links
# The output file names will be built from the keys: "chain_{key}.gz"
# One could probably directly use output file names as keys 
links = {
    "1" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz",
    "2" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz",
    "3" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz"}


rule download:
    output:
        # We inform snakemake that this rule will generate
        # the following list of files:
        # ["outdir/chain_1.gz", "outdir/chain_2.gz", "outdir/chain_3.gz"]
        # Note that we don't need to use {output} in the "run" or "shell" part.
        # This list will be used if we later add rules
        # that use the files generated by the present rule.
        expand("outdir/chain_{n}.gz", n=links.keys())
    run:
        # The sort is there to ensure the files are in the 1, 2, 3 order.
        # We could use an OrderedDict if we wanted an arbitrary order.
        for link_num in sorted(links.keys()):
            shell("wget {link} -O outdir/chain_{n}.gz".format(link=links[link_num], n=link_num))

And here is another way of doing, that uses arbitrary names for the downloaded files and uses output (although a bit artificially):

links = [
    ("foo_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz"),
    ("bar_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz"),
    ("baz_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz")]


rule download:
    output:
        # We inform snakemake that this rule will generate
        # the following list of files:
        # ["outdir/foo_chain.gz", "outdir/bar_chain.gz", "outdir/baz_chain.gz"]
        ["outdir/{f}".format(f=filename) for (filename, _) in links]
    run:
        for i in range(len(links)):
            # output is a list, so we can access its items by index
            shell("wget {link} -O {chain_file}".format(
                link=links[i][1], chain_file=output[i]))
        # using a direct loop over the pairs (filename, link)
        # could be considered "cleaner"
        # for (filename, link) in links:
        #     shell("wget {link} -0 outdir/{filename}".format(
        #         link=link, filename=filename))

An example where the three downloads can be done in parallel using snakemake -j 3:

# To use os.path.join,
# which is more robust than manually writing the separator.
import os

# Association between output files and source links
links = {
    "foo_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz",
    "bar_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz",
    "baz_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz"}


# Make this association accessible via a function of wildcards
def chainfile2link(wildcards):
    return links[wildcards.chainfile]


# First rule will drive the rest of the workflow
rule all:
    input:
        # expand generates the list of the final files we want
        expand(os.path.join("outdir", "{chainfile}"), chainfile=links.keys())


rule download:
    output:
        # We inform snakemake what this rule will generate
        os.path.join("outdir", "{chainfile}")
    params:
        # using a function of wildcards in params
        link = chainfile2link,
    shell:
        """
        wget {params.link} -O {output}
        """
bli
  • 7,549
  • 7
  • 48
  • 94
  • Thanks bli for your great solution. Just one more question. Could this rule also be modified to download the links in parallel? – Masih Jun 18 '17 at 12:57
  • 1
    To run in parallel, you could probably move the `expand` in the `input` of an `all` rule, remove the `for` loop from the `run` part, and use `-j`. The `all` rule will cause the `download` rule to be run once for each wanted file. I will add an example another day, but you may try meanwhile. – bli Jun 18 '17 at 14:09
  • 1
    @user3015703 I added an example for the parallel download. – bli Jun 19 '17 at 09:08