I have a situation where my workflow outputs a main directory, which I emit from a process using DSL2. I feed this output to a python script, which can easily loop over the sub-directories and their respective files, pulling out information and compiling it into a .tsv
Two important pieces of information the python script is getting, is the name of the subdirectory and which file is actually important within the subdirectory.
I would like to take my process output ("root dir") + subdirectory (from file) + important filename (from file) and make it into a new generator path to feed to another process.
Am I just using a bad method? Is there a better way to access a generator? In the documentation I saw subscribe, but I haven't had luck using this functionality. Thank you in advance.
Example .tsv file (column 1 and 3 are what I want to append to generator)
GCF_000005845.2 Escherichia coli str. K-12 substr. MG1655, complete genome GCF_000005845.2_ASM584v2_genomic.fna
GCF_000008865.2 Escherichia coli O157:H7 str. Sakai DNA, complete genome GCF_000008865.2_ASM886v2_genomic.fna
Work directory structure
├── c6
│ └── 6598d4838f61d0421f03216990465c
│ ├── ecoli
│ │ ├── README.md
│ │ └── ncbi_dataset
│ │ ├── data
│ │ │ ├── GCF_000005845.2
│ │ │ │ ├── GCF_000005845.2_ASM584v2_genomic.fna
│ │ │ │ ├── genomic.gff
│ │ │ │ ├── protein.faa
│ │ │ │ └── sequence_report.jsonl
│ │ │ ├── GCF_000008865.2
│ │ │ │ ├── GCF_000008865.2_ASM886v2_genomic.fna
│ │ │ │ ├── genomic.gff
│ │ │ │ ├── protein.faa
│ │ │ │ └── sequence_report.jsonl
│ │ │ ├── assembly_data_report.jsonl
│ │ │ └── dataset_catalog.json
│ │ └── fetch.txt
Here is my nextflow script (constructive criticism very welcome):
#!/usr/bin/env Nextflow
nextflow.enable.dsl=2
workflow {
//ref_genome_ch = Channel.fromPath("$params.ref_genome")
println([params.taxon, params.zipName, params.unzippedDir])
DOWNLOAD_ZIP(params.taxon, params.zipName)
UNZIP(DOWNLOAD_ZIP.out.zipFile)
REHYDRATE(UNZIP.out.unzippedDir)
COLLECT_NAMES(REHYDRATE.out.dataDir)
// I want to get the dir name and file name out of
// relations.txt
//thing = Channel.from( )
//thing.view()
//organism_genomes = REHYDRATE.out.dataDir.subscribe { println("$it/")}
}
process DOWNLOAD_ZIP {
errorStrategy 'ignore'
input:
val taxonName
val zipName
output:
path "${zipName}" , emit: zipFile
script:
def reference = params.reference
"""
datasets download genome \\
taxon '${taxonName}' \\
--dehydrated \\
--filename ${zipName} \\
${reference} \\
--exclude-genomic-cds
"""
}
process UNZIP {
input:
path zipFile
output:
path "${zipFile.baseName}" , emit: unzippedDir
script:
"""
unzip $zipFile -d ${zipFile.baseName}
"""
}
process REHYDRATE {
input:
path unzippedDir
output:
path "$unzippedDir/ncbi_dataset/data" , emit: dataDir
script:
"""
datasets rehydrate \\
--directory $unzippedDir
"""
}
process COLLECT_NAMES {
publishDir params.results
input:
path dataDir
output:
path "relations.txt" , emit: org_names
script:
"""
python "$baseDir/bin/collect_org_names.py" $dataDir
"""
}
Edit: User @Steve recommended channel operators. I don't fully understand the groovy {thing -> stuff} syntax yet, but I tried to do this:
thing = REHYDRATE.out.dataDir.map{"$it/*"}
thing.view()
and I get
/mnt/c/Users/mkozubov/Desktop/nextflow_tutorial/tRNA_stuff/work/d0/long_hash/ecoli/ncbi_dataset/data/*
printed... But when I feed this into a process that just has a script: println(input) I get an error saying that the command executed is null, command ouput is (empty) and that target '*' is not a directory.
My question is why didn't the .map operator expand the * as entering "PATH/*" into a channel would've?
Edit2: I feel like I almost had something. I changed the output of the COLLECT_NAMES script to contain the path to the files. I now want to parse this file and read the contents into a channel. For that I did
organism_genome_files = Channel.from()
COLLECT_NAMES.out.org_names.map {
new File(it.toString()).eachLine { line ->
organism_genome_files << line.split('\t')[3] }
}
If I replace the organism_genome_files << line.split('\t')[3]
with println line.split('\t')[3]
I can see the content I want, but I can't seem to find a way of pulling this info out.
I also tried it with organism_genome_files as a list, but nothing seems to be working, I just can't seem to pull info from channels and effectively mutate it.
The .splitCSV() method seems like it could be useful, but I still don't understand how to get a channel to work as an input to another channel :(