Nextflow combine by regex match

Question

I have a tuple channel containing entries like:

SH7794_SA119138_S1_L001, [R1.fq.gz, R2.fq.gz]

And a csv split into 36 entries, each like:

[samplename:SH7794_SA119138_S1, mouseID:1-4, treat:vehicle, dose:NA, time:day18, tgroup:vehicle__day18, fastqsuffix:_L001_R1_001.fastq.gz, bamsuffix:_Filtered.bam, trim:fulgentTrim, species:human, host:mouse, outlier:NA, RIN:6.9]

I was able to combine the tuple channel with the csv entries using the each keyword. This results in a cross-product of all 36 csv rows for each tuple. I then added a when condition to do the filtering:

  input:
    tuple sampleid, reads from fq
    each samplemeta from samplelist

  ...

  when:
    sampleid.contains(samplemeta.samplename)

This works but I'm curious if this is an appropriate solution. What is the correct way to dynamically join channels using a regular expression, by matching a value from one channel against multiple values from a second channel?

Steve · Accepted Answer · 2022-02-09T09:44:52.440

I tend to avoid using the each qualifier like this because of this recommendation in the docs:

If you need to repeat the execution of a process over n-tuple of elements instead a simple values or files, create a channel combining the input values as needed to trigger the process execution multiple times. In this regard, see the combine, cross and phase operators.

I don't actually think there's a way to join channels using a regex, but what you can do is use the combine operator to produce the Cartesian product of the items emitted by two channels. And if you supply the by parameter, you can combine the items that share a common matching key. For example, untested:

params.reads = '/path/to/fastq/*_{,L00?}_R{1,2}.fq.gz'


Channel
    .fromPath('sample_list.csv')
    .splitCsv(header: true)
    .map { row -> tuple( row.samplename, row ) } 
    .set { sample_metadata }

Channel
    .fromFilePairs( params.reads )
    .combine( sample_metadata, by: 0 )
    .set { test_inputs }


process test {

    input:
    tuple val(sample_id), path(reads), val(metadata) from test_inputs

    script:
    def (fq1, fq2) = reads

    """
    echo "sample_id: ${sample_id}"
    echo "reads: ${fq1}, ${fq2}"
    echo "metadata: ${metadata}"
    """
}

This is quite elegant however the regex in the map closure, `/_L001$/` would have to be dynamic as sample naming varies between projects and vendors. In some cases, sampleids are identical in both channels. If it is possible to regulate sample-naming conventions this will work well. — varontron, Feb 09 '22 at 04:37
@varontron So that regex in the map closure would just remove that suffix if it exists, and should handle the case where the sample names are identical in both channels. This logic is not necessary though - it was just the first solution that came to mind when I read your question. Unless I have misunderstood, you could actually parameterize the input regex. Please see my updated above. — Steve, Feb 09 '22 at 09:43
@varontron If you'd like to default to a pattern that covers most of the common FASTQ filenames, you may like to use [this regex](https://stackoverflow.com/a/70905944/751863) instead. — Steve, Feb 09 '22 at 09:50

Nextflow combine by regex match

1 Answers1