I have just plunged into snakemake pipeline building so maybe I am missing something crucial but to the point:
I've assembled a few steps and on a dummy input, I've tested it locally with success.
Then I tried to run the code on a cluster through qsub
but that seems to fail (this is mostly me not reading enough about it, I am sure) BUT I can run the pipeline on a cluster as an interactive session.
The previous step in the pipeline generates 'kmer counts'. And then this step combines all the files in the input into a single tabular file by means of R's map()
function and some tidyverse transformations.
The dummy case, where I just truncate the input sample_PH=PH_list[:20]
seems to execute, the R session is started and the STDOUT messages are generated.
But giving the full list (8k+ samples) seems to fail with the "non-zero exit " code.
I am not sure why I'm seeing this behaviour. For this step the total memory size of files is 5GB but my interactive node runs at a much higher memory capacity of 200GB.
Any experience in what might be causing this error as I don't know where to start?
Thank you in advance.
rule join_PH_whole:
input:
expand(
"1_data/kma_clustering_{sim}/01_kmc/PATRIC_phage/seq_whole/{kmer}_mer/{sample_PH}.{kmer}_mer.kms",
sample_PH=PH_list, kmer=KMERS, sim=SIMS)
output:
"1_data/kma_clustering_{sim}/02_kmc_joints/PATRIC_phage/seq_whole/{kmer}_mer_kmc_agg.tsv"
params:
out_dir = "1_data/kma_clustering_{sim}/02_kmc_joints/PATRIC_phage/seq_whole",
kmer = "{kmer}",
origin = "-1"
shell:
"Rscript scripts/combine_full_seq_kmc_output.R {params.out_dir} {params.kmer} {params.origin} {input}"