0

Is there exist 'standard' way to force C compiler to not skip 'dummy load' operation that is forcing 'load prefetch' to CPU cache ?

In assembler it is simply load operation like mov eax,[ebx] and assembler can not skip this instruction if eax data is not visibly used anyway.

But C optimizing compiler can skip load operations if it see its data is not used in the further calculations.

So exist many ugly hacks for C compiler like perform some unneeded operations with pre-loaded data like summing up and try to 'compare' its result but it is not nice and load CPU with unneeded instructions. Example:

    long long accumulator=0;
    char *p;
        for(int i=0; i < PREFETCH_SIZE; i+=64)
        {
           accumulator += *(p+i);
        }
    if (accumulator < some_large_dummy_value)
    {
       // do something real useful
    }

May be exist special 'pragma' or other way to force C compiler to not skip 'software guaranteed prefetch' like:

char *p;
for(int i=0; i < PREFETCH_SIZE; i+=64)
{
   char b = *(p+i);
}

I know about _mm_prefetch() but it is 'less guaranteed' in real data prefetch to cache - may be skipped if cause TLB miss, may be limited in 'memory ops buffers overloading' etc,etc.

The intel optimization manual is loaded with 'software prefetch' examples but they are in assembler only form.

It may be closely interconnected with 'locally disabling compiler optimizations' questions like How to prevent gcc optimizing some statements in C? . But most of solutions are compiler-dependent. The promising may be 'volatile' specifier. But it may be working only on memory write opeation ? Or read too ?

EDIT: Finally working solution:

#define CACHE_LINE_SIZE 64
void my_SWprefetch(char *p, int iSize)
{
    
    for(int i=0; i < iSize; i+=CACHE_LINE_SIZE)
    {
        (void)*(volatile char *)(p+i);

    }
}
DTL2020
  • 71
  • 3
  • This has rather strong [XY problem](https://xyproblem.info/) vibes. Why do you need this? Is there some motivating example where skipping/not skipping makes a difference? –  Dec 12 '21 at 10:47
  • If compiler skip this - the whole idea of 'guaranteed software-initiated prefetch' in CPU cache is lost. With assembler we do not have this trouble but it is more popular to use C now. And I looking for the way to use same operation in optimizing C compiler. About XY problem - yes - may be this question is more general: How to force optimizing C compiler to not skip any part or program text ? If it really may looks like 'do nothing' operation to compiler. – DTL2020 Dec 12 '21 at 11:00
  • About examples - the Intel Software Optimization manual is loaded with examples of software-fetching data into cache with such 'load and discard loaded data' operations. So I looking for nice working C-equivalent without some homebrewed hacks if possible. – DTL2020 Dec 12 '21 at 11:07

1 Answers1

1

Cast your pointer to pointer to a volatile object

#define PREFETCH_SIZE 1024

int foo(char *x)
{
    char *p = x;
    for(int i=0; i < PREFETCH_SIZE; i+=64)
    {
        (void)*(volatile char *)(p+i);

    }
}

int bar(char *x)
{
    char *p = x;
    for(int i=0; i < PREFETCH_SIZE; i+=64)
    {
        (void)*(p+i);

    }
}

https://godbolt.org/z/W9onvvMdn

foo:
        mov     rax, rdi
        movzx   edx, BYTE PTR [rdi]
        movzx   edx, BYTE PTR [rdi+64]
        movzx   edx, BYTE PTR [rdi+128]
        movzx   edx, BYTE PTR [rdi+192]
        movzx   edx, BYTE PTR [rdi+256]
        movzx   edx, BYTE PTR [rdi+320]
        movzx   edx, BYTE PTR [rdi+384]
        movzx   edx, BYTE PTR [rdi+448]
        movzx   edx, BYTE PTR [rdi+512]
        movzx   edx, BYTE PTR [rdi+576]
        movzx   edx, BYTE PTR [rdi+640]
        movzx   edx, BYTE PTR [rdi+704]
        movzx   edx, BYTE PTR [rdi+768]
        movzx   edx, BYTE PTR [rdi+832]
        movzx   edx, BYTE PTR [rdi+896]
        movzx   eax, BYTE PTR [rax+960]
        ret
bar:
        ret
0___________
  • 60,014
  • 4
  • 34
  • 74
  • Oh - it is a bit scary. Can it make a loop instead of instructions sequence ? If PREFETCH_SIZE is large enough - it will overload and trash out CPU instructions cache. I tried to set PREFETCH_SIZE to 32768 and it finally make a loop of movzx edx, BYTE PTR [rdi] . Thank you. `foo: lea rax, [rdi+32768] .L2: movzx edx, BYTE PTR [rdi] add rdi, 64 cmp rdi, rax jne .L2 ret` – DTL2020 Dec 12 '21 at 11:41
  • @DTL2020 disable loop unrolling for this function – 0___________ Dec 12 '21 at 11:48