Posix memalign fails on Broadwell but not on Skylake

Question

The following code uses Posix Memalign to allocate four buffers per core, of 128 bytes each. It succeeds on a Skylake but it fails on a Broadwell (earlier generation) with the following message:

posix memalign malloc.c:2401: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed

According to the Linux man pages, memalign will fail in the following cases: (1) the alignment argument was not a power of two, or was not a multiple of sizeof(void *), or (2) there was insufficient memory to fulfill the allocation request. As it generates a segmentation fault, I can't get an error number from rax. The Broadwell has 8GB of memory, so insufficient memory is not the problem. The alignment is 64 so that's not the problem, and in any case it succeeds on my Skylake, so it's written correctly.

Here's the relevant code block:

mov rax,2 ; number of cores
mov rbx,4 ; number of buffers to create
mul rbx
mov r14,rax
xor r13,r13

Memalign_Create_Loop:
mov rax,r15 ; number of cores
mov rbx,r12 ; number of buffers needed
 ; N ptrs per core x 8 bytes per pointer
mul rbx ; times the number of cores
mov r12,rax ; number of buffers needed x number of cores
lea rdi,[memalign_pointer]
mov rsi,64 ; alignment
mov rdx,r12
shl rdx,3
mov rdx,128 ; buffer size
;xor rax,rax    
sub rsp,40
call posix_memalign wrt ..plt
add rsp,40
lea r8,[internal_buffer_pointers]
lea rdi,[memalign_pointer]
mov rax,[rdi]
mov rbx,r13
shl rbx,3
mov [r8+rbx],rax
add r13,1
cmp r13,r14
jl Memalign_Create_Loop

It fails at "call posix_memalign wrt ..plt" and displays the error message shown above, along with a segmentation fault message.

It's a puzzle to me because it succeeds on the Skylake, and posix_memalign predates Broadwell.

I assemble and link with:

sudo nasm -f elf64 -g -F dwarf NWL.asm

sudo ld -shared NWL.o /opt/P01_SH/_Library/Create_Threads_in_C-NWL.o /opt/P01_SH/_Library/Timer_for_NASM.o /opt/P01_SH/_Library/POSIX_Shared_Memory.o /opt/P01_SH/_Library/PThread_Mutex.o /opt/P01_SH/_Library/Create_Multi_Files.o -ldl -lrt -lpthread -lm -o NWL.so

Thanks for any ideas on this.

It seems really unlikely that CPU family would be the determining characteristic. What OS versions are the two machines using? Also, what command are you using to assemble and link? AFAIK the plt isn't usually needed on x86-64; does it work if you just `call posix_memalign`? — Nate Eldredge, Sep 30 '20 at 21:33
I use the procedure linkage table all the time, but your suggestion of a call without it makes sense because that could be where the seg fault is coming from (including the plt in the call). I updated my question with the assemble and link strings (NASM compiler and ld to link). To query the Linux versions and try the call withoout the plt, it's a shared server and I can't get access for about 30 minutes. After that I can update here with the answers. Thanks for your comments. — RTC222, Sep 30 '20 at 21:52
Both systems are Ubuntu 18.04.4 LTS. The server that succeeds is Linux kernel 4.15.0-58-generic and the one that fails is 4.15.0-118-generic. When I eliminate the reference to plt (simply call posix_memalign) I get "ld: NWL.o: relocation R_X86_64_PC32 against undefined symbol `posix_memalign' can not be used when making a shared object; recompile with -fPIC ld: final link failed: Bad value." So I need the plt reference because this is a shared object. — RTC222, Sep 30 '20 at 22:26
The assertion message seems to indicate a case of heap corruption. Then the actual bug would be in earlier code, the corruption is only detected once your function tries another allocation. Both CPUs probably implement different vector instruction sets, and many heavily optimised libraries have code specially tuned for each CPU family, chosen at runtime. This might cause heap allocations of different sizes, with the corruption only hitting one of the cases. — kisch, Sep 30 '20 at 22:52
That seems to suggest that for allocating small internal buffers I should stick to malloc. I wanted alignment but if the price is uneven performance then for small buffers posix_memalign may not be reliable enough. — RTC222, Sep 30 '20 at 22:58
Just prior to the call for this code, I allocate a large posix shared memory buffer. That may be the source of this problem. I'll work on it from that angle. — RTC222, Sep 30 '20 at 23:01
Good point @kisch. Have you tried the usual techniques for debugging malloc crashes, e.g. valgrind? `posix_memalign` is in the same boat as `malloc`, it's not really material here which one you use. This is very likely a simple bug where you just wrote off the end of a block of memory somewhere, or used memory after freeing it, and nothing to do with esoteric details of CPU architecture or whether you allocate large or small blocks. — Nate Eldredge, Sep 30 '20 at 23:09
With the comments from Nate Eldredge and @kisch I know where to look. When I find the error I will post back here. — RTC222, Sep 30 '20 at 23:25
@NateEldredge: You do need explicit PLT or GOT to call library functions if you want to link into a PIE executable (or PIC library). PLT generation by `ld` only happens when linking into a non-PIE executable. [Can't call C standard library function on 64-bit Linux from assembly (yasm) code](https://stackoverflow.com/a/52131094) and the related [Unexpected value of a function pointer local variable](https://stackoverflow.com/q/56760086) — Peter Cordes, Oct 01 '20 at 00:45
@kisch: Skylake didn't add any new SIMD extensions vs. Broadwell, unless the OP is actually talking about Skylake-X (skylake-avx512). But even so, I wouldn't expect `posix_memalign` or the simpler C11 `aligned_alloc` to work differently. But it's certainly possible that different code (AVX512 vs. AVX2) in the OP's program has different bugs, leading to heap corruption in one but not the other via an out-of-bounds array write. — Peter Cordes, Oct 01 '20 at 00:54
@PeterCordes: thanks for the clarification, I didn't compare the CPUs before commenting. This makes a different heap usage pattern from tuned libraries somewhat unlikely. In that case I'd look for maybe a different number of cores, which could hit a multithreading bug. Especially since the posted code explicitly refers to the number of cores. — kisch, Oct 01 '20 at 01:11

RTC222 · Answer 1 · 2020-10-01T20:28:51.030

I solved this problem working from the comment above by kisch ("The assertion message seems to indicate a case of heap corruption. Then the actual bug would be in earlier code, the corruption is only detected once your function tries another allocation").

My NASM code includes two files with memory allocation code. Before the fix it was:

; Create buffers to hold POSIX shared memory: 
mov rbx,2 ; 2 output buffers per core
%include "/opt/P01_SH/_Include_Utilities/POSIX_Shared_Memory.asm"

mov r15,1 ;[Number_Of_Cores_Seq]
mov r12,4 ; number of internal_buffers needed
%include "/opt/P01_SH/_Include_Utilities/Posix_Memalign_Internal.asm"

To fix this, I reversed the order of the calls so that Posix_Memalign_Internal.asm is included before Posix_Shared_Memory.asm.

This is the code from Posix_Memalign_Internal.asm:

mov rax,r15
mov rbx,r12 ; passed in from calling function
mul rbx
mov r14,rax
xor r13,r13

Memalign_Create_Loop:

; Allocate the buffer for the memory pointers
; # of cores times N memory pointers needed (passed in from caller)
; int posix_memalign(void **memptr, size_t alignment, size_t size);

mov rax,r15 ; number of cores
mov rbx,r12 ; number of buffers needed
 ; N ptrs per core x 8 bytes per pointer
mul rbx ; times the number of cores
mov r12,rax ; number of buffers needed x number of cores
lea rdi,[memalign_pointer]
mov rsi,64 ; alignment
mov rdx,r12
shl rdx,3
mov rdx,128 ; buffer size
sub rsp,40
call posix_memalign wrt ..plt
add rsp,40
lea r8,[internal_buffer_pointers]
lea rdi,[memalign_pointer]
mov rax,[rdi]
mov rbx,r13
shl rbx,3
mov [r8+rbx],rax
add r13,1
cmp r13,r14
jl Memalign_Create_Loop

This is the code from Posix_Shared_Memory.asm:

mov rax,[Number_Of_Cores_Seq]
mul rbx
mov [shm_buffers_count],rax
shl rax,3
mov [shm_buffers_bytes],rax
mov r12,rax ; length of pointer buffer

; Buffer for shm pointers
mov rdi,r12
sub rsp,40
call [rel malloc wrt ..got]
mov qword [shm_buffers_ptr],rax
add rsp,40

; Buffer for shm fds
mov rdi,r12
sub rsp,40
call [rel malloc wrt ..got]
mov qword [shm_fds_ptr],rax
add rsp,40

; _________
; Allocate shared memory

xor r12,r12 ; buffer # (0, 8, 16, 24)
xor r13,r13 ; core# (0, 1, 2, 3, ...)
xor r14,r14 ; counter for shm_buffer

Get_ShmName:
lea rdi,[shm_base_name]
mov rsi,r12
mov rdx,r13
call [rel get_shm_name wrt ..got]
mov rdi,rax
push rdi ; shm_name ptr
call [rel unlink_shm wrt ..got]
pop rdi

; Create the shm buffer
mov rsi,[shm_size]
call [rel open_shm wrt ..got]
get_buffer:
mov rdi,rax
mov rax,[rdi]
get_buffer_data:
mov rbp,[shm_fds_ptr]
mov [rbp+r14],rax ; shm_fds
mov rbp,[shm_buffers_ptr]
mov rax,[rdi+8]
mov [rbp+r14],rax ; shm_ptrs
add r12,8
add r14,8
cmp r12,[shm_buffers_bytes]
jl Get_ShmName
next_core:
xor r12,r12
add r13,1
cmp r13,[Number_Of_Cores_Seq]
jl Get_ShmName

Posix_Memalign_Internal.asm uses posix_memalign to create four small buffers of 128 bytes each. Posix_Shared_Memory.asm creates two 4MB shared memory buffers.

My conclusion from this is that when posix_memalign and posix shared memory are used in the same program, posix_memalign should be called before posix shared memory buffers are created. In the final executable they appear on after the other with three instructions in between.

I don't see anything in the code blocks that explains why this is so, but it fixed this problem here.

As a final note, it failed today on a Skylake, so the CPU family is not relevant, as Nate Eldredge said in his comment above.

Thanks to all who responded.

*posix_memalign should be called before posix shared memory buffers are created.* No, there's no reason to expect such interference. Your code is probably just buggy. Perhaps calling one first puts their addresses in a different relative order, which makes a memory corruption bug harmless? — Peter Cordes, Oct 01 '20 at 22:26
Also, I seriously don't get why you're writing this part by hand in asm, especially if you're using slower-than-necessary instructions like `mul` instead of `imul r12, r15`. Just write it in C and let a compiler optimize it to smaller and/or faster code than you're writing, especially given compile-time constants like you're doing with `mov r15,1` before an include. Compilers would optimize away all the calculation to just constants. (Or you could, too, with assemble-time expressions, but then it wouldn't work for runtime variables. Compilers can handle either way efficiently.) — Peter Cordes, Oct 01 '20 at 22:28

Posix memalign fails on Broadwell but not on Skylake

1 Answers1