0

Non-temporal store fails every time for me, but if I replace them with temporal stores it never fails;

The only way I got non-temporal loads/stores not to fail is if I use -fsanitize=address option, and I don't quiet understand why;

I use calloc so the memory I am accessing should be initialized as far as I know;

This is my minimal reproduceable:

#include <stdint.h>
#include <stdlib.h>

#define oopsie 8192 // Loads and stores: 4096 doesn't fail;

int main() {
    uint8_t* in = calloc(oopsie, 32);
    // Calloc output is aligned to the 4k memory page
    if(in == NULL) {
        return 1;
    }
    //for(int i=0; i != oopsie*32 ;i+=32) {
       // Segmentation Fault at iteration 0, before crossing memory page boundry;
        //__asm__("VMOVNTDQA (%0), %%ymm0":: "r"(&(in[i])):"%ymm0"); 
       // AT&T syntax
       // vmovdqu never fails; VMOVNTDQ does;
    //}

    __asm__("VMOVNTDQA (%0), %%ymm0":: "r"(&(in[0])):"%ymm0"); 
    return 0;
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • `calloc` doesn't return 32-byte aligned memory; initialized or not is irrelevant to a load crashing your program. – Peter Cordes Mar 29 '22 at 19:27
  • And you should be using intrinsics instead of inline asm unless you know the machine far better than the compiler does, *and* you have a hot loop you can't get the compiler to make as good asm for as you can write by hand. You forgot [How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/q/56432259) in this case. And if you didn't know VMOVNTDQA requires alignment, and does nothing special on write-back memory, it doesn't sound like you know the machine better than the compiler. – Peter Cordes Mar 29 '22 at 19:31
  • @PeterCordes If it is 4096 Byte aligned, then it surely is 32 Bytes aligned; Because the crash happens at an 4096 Aligned memory region returned from calloc; I verified that from gdb modulo 4096 == 0; 4096/32 = 128 ; the 32 byte regions will never leave a memory page; Or you mean that the physical memory address which gets transformed by MMU isn't 32 byte aligned? I never heard of it being the case really; – Nieważne Nieważne Mar 29 '22 at 19:40
  • If calloc happened to return an address that was 4096 aligned, your code wouldn't crash. But that's often not the case. You're sure you verified the address was 32-byte aligned at the point of the crash? Because the comment in your code `// Calloc output is aligned to the 4k memory page` is just plain not true in general. (You're right that 4k alignment implies 32 byte alignment, but calloc's 2nd arg is another scale factor, not an alignment.) – Peter Cordes Mar 29 '22 at 19:50
  • Yep you were right, it is 16Byte aligned; 0x7ffff7d3c010-0x7ffff7d3c000(looked up memory maps cat /proc/$(pidof a.out)/maps ; I were kinda skeptical, but I were too dumb, and trusted google calc which didn't show floating points when dividing hex values; Used krunner and when I saw that 0x7ffff7d3c010/32 has a dangling "half" I then realized how stupid I am – Nieważne Nieważne Mar 29 '22 at 20:02
  • Yup, 32-byte aligned means the 2nd hex digit has to be even, like 0x...20 or 40, but not 10 or 30. The low 5 bits of the address have to be zero. Glibc malloc/calloc typically keep the first 16 bytes of a page for their bookkeeping info, and return a pointer that's 16 bytes away from page-aligned (for large allocations), which sucks especially if you're allocating a multiple of the page size. – Peter Cordes Mar 29 '22 at 20:11
  • Yeah I didn't had to think about alignment before as much before; Usually I just let C abstract it for me, or just cared how the (as an end effect) compiler re-arranges the memory; I don't calculate the hex values by head(Even though I am well aware it works the same as binary/decimal, but base 16); Either way thanks, I can just use anonymous page allocation via mmap; – Nieważne Nieważne Mar 29 '22 at 20:48
  • BTW, the way I'd check for alignment with a calculator is `addr & 31` or equivalently `addr % 32`, instead of dividing and looking for a fractional quotient. I'd be thinking in terms of integers and bits, not fractions. Anyway yeah, `mmap` definitely gives you page-aligned memory, great if you don't need a pointer you can pass to `free`. Glibc malloc uses mmap / munmap itself for large allocations. Knowing you have page-aligned memory also means you can easily use `madvise(MADV_HUGEPAGE)` on it. – Peter Cordes Mar 29 '22 at 21:04

0 Answers0