I'm trying to run HPL Linpack on my personal laptop. I'm using CentOS 8 on a VM.
Allocated cores : 6
Memory : 12.5 gb
Nodes : 1
When I run with smaller values of N, its running fine, but when I try to maximise the CPU usage, with bigger values of N(trying to go upto 75-80% of usage), I'm getting different errors each time.
ERRORS - All errors popped up on separate runs.
[1617771807.179752] [localhost:3301 :0] sock.c:344 UCX ERROR recv(fd=28) failed: Bad address
[1617771807.188129] [localhost:3298 :0] sock.c:344 UCX ERROR recv(fd=27) failed: Connection reset by peer
[1617771807.249456] [localhost:3298 :0] sock.c:344 UCX ERROR sendv(fd=-1) failed: Bad file descriptor
[localhost:03298] *** An error occurred in MPI_Send
[localhost:03298] *** reported by process [3696427009,2]
[localhost:03298] *** on communicator MPI COMMUNICATOR 5 SPLIT FROM 3
[localhost:03298] *** MPI_ERR_OTHER: known error not in list
[localhost:03298] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[localhost:03298] *** and potentially your MPI job)
_________________________________________________________________________________________
malloc(): corrupted top size
[localhost:06009] *** Process received signal ***
[localhost:06009] Signal: Aborted (6)
[localhost:06009] Signal code: (-6)
[localhost:06009] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7f230e65cb20]
[localhost:06009] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f230e2be7ff]
[localhost:06009] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f230e2a8c35]
[localhost:06009] [ 3] /lib64/libc.so.6(+0x7a987)[0x7f230e301987]
[localhost:06009] [ 4] /lib64/libc.so.6(+0x81d8c)[0x7f230e308d8c]
[localhost:06009] [ 5] /lib64/libc.so.6(+0x851f5)[0x7f230e30c1f5]
[localhost:06009] [ 6] /lib64/libc.so.6(__libc_malloc+0x1e2)[0x7f230e30d412]
[localhost:06009] [ 7] ./xhpl[0x4232e3]
[localhost:06009] [ 8] ./xhpl[0x4202cd]
[localhost:06009] [ 9] ./xhpl[0x41168e]
[localhost:06009] [10] ./xhpl[0x408eff]
[localhost:06009] [11] ./xhpl[0x4018aa]
[localhost:06009] [12] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f230e2aa7b3]
[localhost:06009] [13] ./xhpl[0x401cae]
[localhost:06009] *** End of error message ***
_________________________________________________________________________________________
corrupted size vs. prev_size
[localhost:05847] *** Process received signal ***
[localhost:05847] Signal: Aborted (6)
[localhost:05847] Signal code: (-6)
[localhost:05847] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7f07c812eb20]
[localhost:05847] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f07c7d907ff]
[localhost:05847] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f07c7d7ac35]
[localhost:05847] [ 3] /lib64/libc.so.6(+0x7a987)[0x7f07c7dd3987]
[localhost:05847] [ 4] /lib64/libc.so.6(+0x81d8c)[0x7f07c7ddad8c]
[localhost:05847] [ 5] /lib64/libc.so.6(+0x825e6)[0x7f07c7ddb5e6]
[localhost:05847] [ 6] /lib64/libc.so.6(+0x83a1b)[0x7f07c7ddca1b]
[localhost:05847] [ 7] ./xhpl[0x423596]
[localhost:05847] [ 8] ./xhpl[0x4202a6]
[localhost:05847] [ 9] ./xhpl[0x41168e]
[localhost:05847] [10] ./xhpl[0x408eff]
[localhost:05847] [11] ./xhpl[0x4018aa]
[localhost:05847] [12] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f07c7d7c7b3]
[localhost:05847] [13] ./xhpl[0x401cae]
[localhost:05847] *** End of error message ***
using formula :
N = int((round(sqrt((memory_per_node * 1024 * 1024 * 1024 * nodes)/8))) * percentage_usage)