2

I am using Nextflow.io to schedule several thousand analysis jobs and then join the outputs.

Nextflow is a DSL that allows me to specify channels and processes and schedule and run those. Under the hood, it creates bash scripts for each process, which is why I'm posting here rather than https://github.com/nextflow-io/nextflow .

I can provide a full version of the script, but this is a cutdown version:

#!/bin/bash

nxf_stage() {
    true
    #THIS IS WHERE IT BREAKS
    ...
}    

nxf_main() {
    trap on_exit EXIT
    trap on_term TERM INT USR1 USR2
    # some more prep here
    nxf_stage
    ...
    wait $pid || nxf_main_ret=$?
    ...
    nxf_unstage
}

$NXF_ENTRY

The purpose of the nxf_stage function is to prepare the files that that process needs. In place of the comment above where I've said it breaks is approximately 76,000 lines like this:

rm -f result_job_073241-D_RGB_3D_3D_side_far_0_2019-03-12_03-25-01.json

followed by the same number of lines like this:

ln -s /home/ubuntu/plantcv-pipeline/work/8d/ffe3d29ee581c09d3d25706c238d1d/result_job_073241-D_RGB_3D_3D_side_far_0_2019-03-12_03-25-01.json result_job_073241-D_RGB_3D_3D_side_far_0_2019-03-12_03-25-01.json

When I try and execute the nextflow script, I get this error:

Segmentation fault (core dumped)

I was able to debug it to that function just with echo statements either side but nothing in that function seems complicated to me. Indeed when I stripped back everything else and just left the script as ~152,000 lines of rm and ln commands, it just worked.

Is it possible that a function of this size has a memory footprint causing the segfault? It seems that each command itself is small.

Update:

Output of bash -x:

+ set -x
+ set -e
+ set -u
+ NXF_DEBUG=0
+ [[ 0 > 1 ]]
+ NXF_ENTRY=nxf_main
+ nxf_main
+ trap on_exit EXIT
+ trap on_term TERM INT USR1 USR2
++ dd bs=18 count=1 if=/dev/urandom
++ base64
++ tr +/ 0A
+ export NXF_BOXID=nxf-1qYK72XftztQW4ocxx3Fs1tC
+ NXF_BOXID=nxf-1qYK72XftztQW4ocxx3Fs1tC
+ NXF_SCRATCH=
+ [[ 0 > 0 ]]
+ touch /home/ubuntu/plantcv-pipeline/work/ec/da7ca4e909b2cc4a74ed8963cc5feb/.command.begin
+ set +u
+ set -u
+ [[ -n '' ]]
+ nxf_stage
Segmentation fault (core dumped)
George
  • 187
  • 1
  • 2
  • 12
  • 1
    A script is nothing more than sequential commands, so I think nextflow causes the segmentation fault somehow. Can you try to give the JVM more memory with `-Xmx`? – Bayou Nov 20 '19 at 07:25
  • I don't _think_ Nextflow is actually running anything when I execute the `.command.run` script that it produces manually. Nextflow produces that script and then kicks it off. You can run it manually to debug. – George Nov 20 '19 at 08:34
  • 1
    Does the script produce a segmentation fault when you manually run it as `./yourscript.sh`? – Bayou Nov 20 '19 at 10:33
  • 2
    @George : For debugging, don't put `echo` at arbitrary points, but run the script with `bash -x` instead. Alternatively, put a `set -x` as the first command of the script. – user1934428 Nov 20 '19 at 11:42
  • 1
    What happens when you remove a few thousand of those jobs? – Beta Nov 20 '19 at 19:18
  • When I removed most of the lines (left 10 of each `rm` and `ln`), it run successfully. – George Nov 20 '19 at 23:44
  • Added `bash -x` output in update. – George Nov 20 '19 at 23:46
  • It looks like it might be related to the maximum stack size (`ulimit`) – George Nov 21 '19 at 05:00

1 Answers1

3

I came upon the solution here: https://stackoverflow.com/a/14471782/5447556

Essentially, the function was being allocated on the stack and was hitting the soft stack space limit.

Running ulimit -sS unlimited allowed me to increase this limit.

adding beforeScript: 'ulimit -sS unlimited' to my nextflow process was a successful workaround. It's worth noting that this won't work in extreme cases, and is quite clunky. I think the real solution will be for nextflow to implement a more robust staging process for large input channels.

George
  • 187
  • 1
  • 2
  • 12
  • 2
    `ulimit -s unlimited` Was the solution for me. The -s modifier was also what was used in the linked solution. – Maarten Sep 25 '21 at 13:52