Randomly throwing bus error on slurm cluster

Question

I am running the same executable on a HPC cluster with different input arguments. Usually I submit several hundreds of jobs at once (using job arrays or bash loops). Some jobs suddenly crash with a BUS ERROR message:

/var/spool/slurmd/job58791836/slurm_script: line 193: 3086318 Bus error
(core dumped) ./${exec_name}.o
"${@:14}" -L $L -J $J -J0 $J0 -g $g -g0 $g0 -h $h -w $wx -th $thread_num
-m 1 -r $r -k $k_sym -p $p_sym -x $x_sym 
-op $operator -fun $fun -s $site -b 0 -ch $ch -seed $seed -jobid $jobid -q_ipr 2.0 1>&${filename}.log

The submitted jobs require at most ~4GB of memory, however, I am allocating at least 12GB and the error persists.

My code is based on ARMADILLO C++ and I compile it using:

icpx main.cpp XYZ_UI.cpp XYZ_sym.cpp XYZ.cpp -o ${exec_name}.o\
 -pthread -lhdf5 -Wall -Wformat=0 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core\
 -liomp5 -lpthread -lm -ldl -lmkl_sequential -lstdc++fs -fopenmp -std=c++2a -O3 ${compile_suffix} "${@:1}"

My concern is that most of the nodes on the cluster I use are AMD-based and the intel compiler might use the intel optimisation instructions developed for intel cpus.

I used valgrind to check whether there is any memory leak and it only found still reachable, which should not produce any problems for the code. Here is the output by valgrind: https://pastebin.com/0mqR6gNx

Is there anything wrong with compiling a c++ code with intel for AMD cpus? Is there some other possiblility for a bus error to occur other than cpu mismatch or memory allocation problems? Can the buss error occur due to using the same executable and accessing some compiled shared libraries?

I reviewed several forums, but none seem to apply to my problem:

What is a bus error? Is it different from a segmentation fault?

https://ask.cyberinfrastructure.org/t/what-does-it-mean-when-i-get-a-bus-error-in-my-job/1101

EDIT: The program fails immediately, there is no output from the code (I print the input parameters as you can see in the valgrind output in the pastebin link) created before the SIGBUS is triggered.

From my experience SIGBUS on clusters is more likely related to filesystem errors. Lustre likes to cause these issues. On the program-side, it could be a [misaligned atomic](https://lwn.net/Articles/911219/) But these are just shots in the dark. You need to attach a debugger and find the operation that causes the signal — Homer512, Mar 10 '23 at 09:35
Thanks, I'll read into that. This might be indeed a Lustre issue, I should contact the cluster support — Qant123, Mar 10 '23 at 09:38
If you can find the core dump, you might see quickly where it came from. https://www.cse.unsw.edu.au/~learn/debugging/modules/gdb_coredumps/ — Homer512, Mar 10 '23 at 10:26
Just saw your edit: If the program fails before it even reaches your main function, it's very likely the file system. That happens because shared libraries are memory-mapped and then the first access fails when kernel has to read the code from the network. — Homer512, Mar 12 '23 at 09:18

Randomly throwing bus error on slurm cluster

0 Answers0