0

I have a project in C which I am compiling with gcc. It sometimes shows a segmentation fault when run and sometimes it doesn't. When run with gcc $ gdb -q ./build/program, it does not show any errors. After doing a bit of research, I found this question on stack overflow. Setting (gdb) set disable-randomization off does allow me to see the segmentation fault in gdb. However, I have no clue what address space randomization does and where should I look to find the problem. Is there a particular tool or a particular type of construct that I should look into?

I am also including the backtrace here for more context:

(gdb) set disable-randomization off
(gdb) run -n 10 -s 10
Starting program: /path/to/code/build/exact_diag_simulation -n 10 -s 10
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Exact Diagonalization
---------------------
len: 10  nospin: 0       coupling: 0.00e+00
disorder: 1.00e+01       hopping: 1.00e+00
Starting Simulation for Exact Diagonalization...
Run 1 started...[New Thread 0x7f4e34bff6c0 (LWP 12972)]
[New Thread 0x7f4e343fe6c0 (LWP 12973)]
[New Thread 0x7f4e33bfd6c0 (LWP 12974)]
[New Thread 0x7f4e333fc6c0 (LWP 12975)]
[New Thread 0x7f4e32bfb6c0 (LWP 12976)]
[New Thread 0x7f4e323fa6c0 (LWP 12977)]
[New Thread 0x7f4e31bf96c0 (LWP 12978)]

Thread 1 "exact_diag_simu" received signal SIGSEGV, Segmentation fault.
0x00007f4e75934a02 in zgemv_n_SKYLAKEX () from /usr/lib/libblas.so.3
(gdb) bt
#0  0x00007f4e75934a02 in zgemv_n_SKYLAKEX () from /usr/lib/libblas.so.3
#1  0x00007f4e74ff550c in zgemv_ () from /usr/lib/libblas.so.3
#2  0x00007f4e76772142 in zlatrd_ () from /usr/lib/liblapack.so.3
#3  0x00007f4e766f3ba0 in zhetrd_ () from /usr/lib/liblapack.so.3
#4  0x00007f4e766ea105 in zheev_ () from /usr/lib/liblapack.so.3
#5  0x00007f4e77195e07 in LAPACKE_zheev_work () from /usr/lib/liblapacke.so.3
#6  0x00007f4e77195fba in LAPACKE_zheev () from /usr/lib/liblapacke.so.3
#7  0x00005612184a3287 in utils_get_eigh (matrix=0x7f4e3135c010, size=200, eigvals=0x561218f1a0c0)
    at src/utils/utils.c:273
#8  0x00005612184a27c4 in run (params=0x7fffa9f5ffd0, create_neighbours=1, gfunc=0x7f4e74ee1010)
    at src/exact_diag_simulation.c:155
#9  0x00005612184a2577 in main (argc=5, argv=0x7fffa9f60348) at src/exact_diag_simulation.c:103

I have compiled with the following flags in my makefile:

CFLAGS=-Wall -Wextra -g -fdiagnostics-color=always -fopenmp -ffast-math -fsanitize=address,undefined
LFLAGS=-llapacke -lm -lgsl -lcblas
adch99
  • 340
  • 1
  • 7
  • 4
    A program that sometimes works and sometimes crashes is indicative of *undefined behavior* in the code. I recommend you use the GCC sanitizer (build with `-fsanitize=address,undefined`) and see what it tells you. – Some programmer dude Aug 25 '22 at 08:03
  • 4
    You may also want to compile with `-g` when using the sanitizers. It usually produces spot on messages showing you where bad things happened. – Ted Lyngmo Aug 25 '22 at 08:06
  • 1
    @Someprogrammerdude I did a clean build with `-fsanitize=address,undefined` and then ran the executable. This leads to the segmentation fault not occuring at all (I did like 15-20 runs of the program and none of them showed the segmentation fault unlike before). No output from the sanitizer as well. – adch99 Aug 25 '22 at 08:14
  • 2
    GDB shows in its backtrace as "lowest" user function this location: `src/utils/utils.c:273`. Did you check that? Probably one of the arguments to the called `LAPACKE_zheev()` is erroneous, and it is passed down the chain until finally `zgemv_n_SKYLAKEX()` faults. You could insert some assertions (or primitive `printf()` debugging) to investigate further. – the busybee Aug 25 '22 at 08:20
  • Unfortunately, @thebusybee, that doesn't help me much. I have checked the arguments and also written a test for the function (which doesn't segfault). So I am not sure what is wrong. I feel like it might be a concept I'm unfamiliar with. My question here is what can possibly trigger a segfault when address space randomization is turned on. – adch99 Aug 25 '22 at 08:42
  • 1
    @adch99 Can you put create a [mre] and put that in your question? Sometimes the optimization level combined with `-g -fsanitize=address,undefined` affects the result. Try with `-O0` first, then `-O3`, to see if that triggers the segfault. – Ted Lyngmo Aug 25 '22 at 08:45
  • 1
    It could be as simple as out-of-bounds accesses. Especially the latter is Undefined Behavior. – the busybee Aug 25 '22 at 08:49

0 Answers0