The following code is intended to test the speed of reading memory jumping by more than the 8 elements of DDR4 burst, and two banks of RAM. So it starts by subtracting from the stack point to create a region to play in, then reading from the first location, and hopefully every 32nd 64-bit word after that.
It doesn't get that far. Instead, it segfault on the first read from the stack. I then wrote something equivalent in C++, and I can see it calls a function to allocate memory on the stack.
I have never seen that function call before, and clearly every time the stack decreases it does not have to be called. Most functions just subtract from rsp. What am I doing wrong?
.globl main
main:
mov $5000000, %rcx
sub $400000, %rsp # make room on the stack for 8 bytes of space to play
loop:
mov %rsp, %rdx
mov $2000, %r9
inner:
mov (%rdx), %rax # memory reads skip so each read has to wait
add $256, %rdx # skip forward by 32 64-bit words
sub $1, %r9
jnz inner
sub $1, %rcx
jnz loop
add $400000, %rsp
ret
The C++ code had to be modified to sum the uninitialized array, otherwise the optimizer would conclude the reading is dead code.
#include <cstdint>
uint64_t sumit() {
uint64_t a[50000];
uint64_t sum = 0;
for (int i = 0; i < 50000; i += 32)
sum += a[i];
return sum;
}
This results in the following assembler under msys2:
_Z5sumitv:
.LFB3:
movl $400008, %eax
call ___chkstk_ms
subq %rax, %rsp
.seh_stackalloc 400008
.seh_endprologue
xorl %edx, %edx
movq %rsp, %rax
leaq 400128(%rsp), %rcx
.p2align 4,,10
.p2align 3
.L2:
addq (%rax), %rdx
addq $256, %rax
cmpq %rcx, %rax
jne .L2
movq %rdx, %rax
addq $400008, %rsp
ret
.seh_endproc
For example, this code generates a subtraction of 32 bytes off the stack. Clearly, the call is only necessary for "big" values. Can you explain how this works, and what the number is?
uint64_t sumit(uint64_t a, uint64_t b) {
uint64_t x = a+b, y = a-b;
return f(x,y) - 3*(x - y);
}