3

I am working with a nextflow workflow that, at a certain stage, groups a series of files by their sample id using groupTuple(), and resulting in a channel that looks like this:

[sample_id, [file_A, file_B, ... , file_N]]
[sample_id, [file_A, file_B, ... , file_N]]
...
[sample_id, [file_A, file_B, ... , file_N]]

Note that this is the same channel structure that you get from .fromFilePairs().

I want to use these channel items in a process in such a way that, for each item, the process reads the sample_id from the first field and all the files from the inner tuple at once.

The nextflow documentation is somewhat cryptic about this, and it is hard to find how to declare this type of input in a channel, so I thought I'd create a question on stack overflow and then answer it myself for anyone who will ever be looking for this answer.

How does one declare the inner tuple in the input section of a nextflow process?

schmat_90
  • 572
  • 3
  • 22

3 Answers3

2

The path qualifier (previously the file qualifier) can be used to stage a single (file) value or a collection of (file) values into the process execution directory. The note at the bottom of the multiple input files section in the docs also mentions:

The normal file input constructs introduced in the input of files section are valid for collections of multiple files as well.


This means, you can use a script variable, e.g.:

input:
tuple val(sample_id), path(my_files)

In which case, the variable will hold the list of files (preserving the original filenames). You could use it directly to refer to all of the files in the list, or, you could access specific (file) elements (if you need them) using square bracket (slice) notation.

This is the syntax you will want most of the time. However, if you need predicable filenames or if you need to deal with files with the identical filenames, you may need a different approach:


Alternatively, you could specify a target filename, e.g.:

input:
tuple val(sample_id), path('my_file')

In the case where a single file is received by the process, the file would be staged with the target filename. However, when a collection of files is received by the process, the filename will be appended with a numerical suffix representing its ordinal position in the list. For example:

process test {

    tag { sample_id }

    debug true
    stageInMode 'rellink'

    input:
    tuple val(sample_id), path('fastq')

    """
    echo "${sample_id}:"
    ls -g --time-style=+"" fastq*
    """
}

workflow {

    readgroups = Channel.fromFilePairs( '*_{1,2}.fastq' )
    
    test( readgroups )
}

Results:

$ touch {foo,bar,baz}_{1,2}.fastq
$ nextflow run . 
N E X T F L O W  ~  version 22.04.4
Launching `./main.nf` [scruffy_caravaggio] DSL2 - revision: 87a80d6d50
executor >  local (3)
[65/66f860] process > test (bar) [100%] 3 of 3 ✔
baz:
lrwxrwxrwx 1 users 20  fastq1 -> ../../../baz_1.fastq
lrwxrwxrwx 1 users 20  fastq2 -> ../../../baz_2.fastq

foo:
lrwxrwxrwx 1 users 20  fastq1 -> ../../../foo_1.fastq
lrwxrwxrwx 1 users 20  fastq2 -> ../../../foo_2.fastq

bar:
lrwxrwxrwx 1 users 20  fastq1 -> ../../../bar_1.fastq
lrwxrwxrwx 1 users 20  fastq2 -> ../../../bar_2.fastq

Note that the names of staged files can be controlled using the * and ? wildcards. See the links above for a table that shows how the wildcards are replaced depending on the cardinality of the input collection.

Steve
  • 51,466
  • 13
  • 89
  • 103
  • How does one iterate over these files after catching them in variable `path(my_files)`? In this case: `[file_A, file_B, ... , file_N]` iterate in the same order – Death Metal Jul 19 '23 at 17:51
  • 1
    @DeathMetal It depends on what you want to do with the collection exactly, but if you want to get back a collection after some transformation (specified using a closure), you can use [`collect`](https://docs.groovy-lang.org/docs/groovy-1.7.3/html/groovy-jdk/java/util/Collection.html#collect) – Steve Jul 20 '23 at 05:59
1

In the example given above, my inner tuple contains items of only one type (files). I can therefore pass the whole second term of the tuple (i.e. the inner tuple) as a single input item under the file() qualifier. Like this:

input:
tuple \
val(sample_id), \
file(inner_tuple) \
from Input_channel

This will ensure that the tuple content is read as file (one by one), the same way as performing .collect() on a channel of files, in the sense that all files will then be available in the nextflow temp directory where the process is executed.

schmat_90
  • 572
  • 3
  • 22
  • 1
    Using a script variable only half answers the question. You could also use a target filename and the `*` or `?` wildcards to rewrite the input filenames if required. The normal file input constructs are also valid for collections of files. – Steve Oct 06 '22 at 15:18
1

The question is how you come up with sample_id, but in case they just have different file extensions you might use something like this:

all_files = Channel.fromPath("/path/to/your/files/*")
all_files.map { it -> [it.simpleName, it] }
         .groupTuple()
         .set { grouped_files }

Patrick H.
  • 168
  • 7
  • what does `it -> [it.simpleName, it]` do? – Death Metal Aug 03 '23 at 22:30
  • Hi. Apparently I misunderstood the OP. I thought he would like to know how to create this structure, instead he wanted to know how to process it. Anyway: that line creates a tuple from each file element of the channel, by adding the "simpleName" which is a basename without file-extensions on index 0 and index 1 containing the file object. – Patrick H. Aug 08 '23 at 07:54