0

I'm trying to manually benchmark the time taken to read the elements from the array, So I created an assembly code that reads from the array passed.
Sample assembly code:

; void array_read(long n, double *A);
; n must be divisible by 64

section .text

%ifidn __OUTPUT_FORMAT__, macho64
%define array_read _array_read
%endif

global array_read

array_read:
        ;   rdi = n
        ;   rsi = A
        ;   rax = i
        push rbx
        mov rax, 0
        align 16
.main_loop:
        vmovaps ymm0,  [rsi+8*(rax+ 0)]
        vmovaps ymm1,  [rsi+8*(rax+ 4)]
        vmovaps ymm2,  [rsi+8*(rax+ 8)]
        vmovaps ymm3,  [rsi+8*(rax+12)]
        add rax, 64
        cmp rax, rdi
        jl .main_loop
.epilog:
        ; Restore caller registers
        pop rbx
        ret

I'm lucky in MacOS(Intel) it aligns correctly and provides the required output but in Linux it gives

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7fe37cdee20f in ???
#1  0x557940d91600 in ???
Segmentation fault (core dumped)

Why Segmentation fault?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Are you asking how to allocate an aligned buffer in the caller in the first place? If so, is it even written in asm? If not, [How to solve the 32-byte-alignment issue for AVX load/store operations?](https://stackoverflow.com/q/32612190). Or are you asking about implementation strategies for handling the first maybe-unaligned vector of an array and then doing aligned loads? (Instead of just using `vmovups` to let hardware handle the minor performance cost of cache-line splits if the caller passes a misaligned buffer) – Peter Cordes Oct 16 '21 at 06:44
  • I think you need to pass aligned data when you calling your assembly method. it is better than trying to align it here. – Afshin Oct 16 '21 at 06:44
  • Also, what's the point of push/pop of RBX here? Your loop doesn't need any call-preserved registers. It also doesn't need an index; it could just increment the pointer. – Peter Cordes Oct 16 '21 at 06:45
  • `array A` has random values stored, by passing `A` to a `random_number` function in Fortran. – Addiction99 Oct 16 '21 at 07:06
  • I wanted to know, why do I get `Segmentation fault`??, but not in MacOS – Addiction99 Oct 16 '21 at 07:07
  • Yeah, I know what push and pop do, but *why* would you save/restore RBX? Your function doesn't use it, or any of the other call-preserved registers, so the cheapest way to follow the calling convention is not to touch it at all. – Peter Cordes Oct 16 '21 at 07:28
  • 1
    Why do you get a segfault? Because you passed an unaligned pointer to a function that used `vmovaps` which requires 32-byte alignment. Read the manual for `vmovaps` (https://www.felixcloutier.com/x86/movaps) vs. `vmovups`. If you want to know about Fortran memory-alignment rules, and how to ask for more alignment, ask that instead. In the x86-64 System V ABI, `alignof(max_align_t)` is 16, so things like malloc are only guaranteed to return memory with that much alignment. (Also for global arrays). – Peter Cordes Oct 16 '21 at 07:31
  • If you mean why Linux happens to return memory not aligned by 32, unfortunately glibc malloc, when it allocates a whole page or more, typically keeps the first 16B of it for bookkeeping, so the returned pointer is an odd multiple of 16. But if your array is on the stack or global, then that doesn't apply so it's just chance / coincidence. You didn't show any of the code in your caller. – Peter Cordes Oct 16 '21 at 07:32
  • There is a [Why are some arrays not properly aligned for vectorization in Fortran?](https://stackoverflow.com/q/68531942) question on SO, but it's unanswered. – Peter Cordes Oct 16 '21 at 07:34
  • Thank you @PeterCordes, [YASM: vmovaps instruction causing segmentation fault](https://stackoverflow.com/questions/54544056/yasm-vmovaps-instruction-causing-segmentation-fault) was really helpful... – Addiction99 Oct 16 '21 at 09:44

1 Answers1

1

In C++, try to call your function like this:

alignas(32) double arr[16];
read_array(sizeof(arr), arr);
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Afshin
  • 8,839
  • 1
  • 18
  • 53