1

The following code is intended to test the speed of reading memory jumping by more than the 8 elements of DDR4 burst, and two banks of RAM. So it starts by subtracting from the stack point to create a region to play in, then reading from the first location, and hopefully every 32nd 64-bit word after that.

It doesn't get that far. Instead, it segfault on the first read from the stack. I then wrote something equivalent in C++, and I can see it calls a function to allocate memory on the stack.

I have never seen that function call before, and clearly every time the stack decreases it does not have to be called. Most functions just subtract from rsp. What am I doing wrong?

.globl main
main:
    mov $5000000, %rcx
    sub $400000, %rsp  # make room on the stack for 8 bytes of space to play
loop:
    mov %rsp, %rdx
    mov $2000, %r9
inner:
    mov (%rdx), %rax   # memory reads skip so each read has to wait
    add $256, %rdx     # skip forward by 32 64-bit words
    sub $1, %r9
    jnz inner

    sub $1, %rcx
    jnz loop
    add $400000, %rsp
    ret
    

The C++ code had to be modified to sum the uninitialized array, otherwise the optimizer would conclude the reading is dead code.

#include <cstdint>

uint64_t sumit() {
    uint64_t a[50000];
    uint64_t sum = 0;
    for (int i = 0; i < 50000; i += 32)
      sum += a[i];
    return sum;
}

This results in the following assembler under msys2:

_Z5sumitv:
.LFB3:
    movl    $400008, %eax
    call    ___chkstk_ms
    subq    %rax, %rsp
    .seh_stackalloc 400008
    .seh_endprologue
    xorl    %edx, %edx
    movq    %rsp, %rax
    leaq    400128(%rsp), %rcx
    .p2align 4,,10
    .p2align 3
.L2:
    addq    (%rax), %rdx
    addq    $256, %rax
    cmpq    %rcx, %rax
    jne .L2
    movq    %rdx, %rax
    addq    $400008, %rsp
    ret
    .seh_endproc

For example, this code generates a subtraction of 32 bytes off the stack. Clearly, the call is only necessary for "big" values. Can you explain how this works, and what the number is?

uint64_t sumit(uint64_t a, uint64_t b) {
    uint64_t x = a+b, y = a-b;
    return f(x,y) - 3*(x - y);
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Dov
  • 8,000
  • 8
  • 46
  • 75
  • 3
    `___chkstk_ms` is the key; you're on Windows, which won't grow the stack by more than 1 page at a time, even if RSP is within the 1MiB total size limit. Unlike on Linux; [How is Stack memory allocated when using 'push' or 'sub' x86 instructions?](https://stackoverflow.com/q/46790666) describes the difference: Linux doesn't need stack probes. – Peter Cordes Feb 05 '23 at 01:23

0 Answers0