10

I just tested a small example to check whether __restrict__ works in C++ on the latest compilers:

void foo(int x,int* __restrict__ ptr1, int& v2) {
   for(int i=0;i<x;i++) {
       if(*ptr1==v2) {
           ++ptr1;
       } else {
           *ptr1=*ptr1+1;
       }
   }
}

When trying it on godbolt.org with the latest gcc (gcc8.1 -O3 -std=c++14), the __restrict__ works as expected: v2 is loaded only once, since it cannot alias with ptr1.

Here are the relevant assembly parts:

.L5:
  mov eax, DWORD PTR [rsi]
  cmp eax, ecx # <-- ecx contains v2, no load from memory
  jne .L3
  add edx, 1
  add rsi, 4
  cmp edi, edx
  jne .L5

Now the same with the latest clang (clang 6.0.0 -O3 -std=c++14). It unrolls the loop once, so the generated code is much bigger, but here is the gist:

.LBB0_3: # =>This Inner Loop Header: Depth=1
  mov edi, dword ptr [rsi]
  cmp edi, dword ptr [rdx] # <-- restrict didn't work, v2 loaded from memory in hot loop
  jne .LBB0_9
  add rsi, 4
  mov edi, dword ptr [rsi]
  cmp edi, dword ptr [rdx] # <-- restrict didn't work, v2 loaded from memory in hot loop
  je .LBB0_12

Why is this the case? I know that __restrict__ is non-standard and the compiler is free to ignore it, but it seems to be a very fundamental technique for getting the last bit of performance out of ones code, so I doubt that clang simply does not support it while supporting and ignoring the keyword itself. So, what is the issue here? Am I doing anything wrong?

gexicide
  • 38,535
  • 21
  • 92
  • 152
  • Probably worth asking this on the Clang Dev mailing list? – Mats Petersson May 16 '18 at 07:54
  • 2
    Maybe because `__restrict__` is never defined in C++ standard and is just a gcc extension? – Serge Ballesta May 16 '18 at 08:12
  • 1
    And not only is it a GCC extension, it's an extension which can safely be ignored. It only affects efficiency, not correctness. – MSalters May 16 '18 at 08:51
  • 1
    If *"it seems to be a very fundamental technique for getting the last bit of performance out of ones code"* you should also profile the two snippets to verify the actual impact on performances. – Bob__ May 16 '18 at 09:02
  • 2
    @Bob__: I did. Not for this example, but for examples from our real code. This is why I am playing around with `__restrict__` in the first place. I am not getting paid for premature optimizations ;). We have very tight hot loops and the extra memory load costs us measurable performance. – gexicide May 16 '18 at 09:29
  • 2
    @MSalters: I know. Of course it can be ignored. But it is very viable, still. Only because something *can* be ignored does not mean modern compilers do so. clang is usually on top of most compilers when it comes to optimization potential. I just find it strange, would they just have ignored the potential in this case. – gexicide May 16 '18 at 09:30
  • Clang 7 User's Manual does not mention `__restrict__` (nor `__restrict`) at all: https://clang.llvm.org/docs/UsersManual.html. Some info can be found here: [Restrict-qualified pointers in LLVM](https://llvm.org/devmtg/2017-02-04/Restrict-Qualified-Pointers-in-LLVM.pdf). – Daniel Langr May 16 '18 at 09:50
  • 1
    @gexicide: But you don't _need_ the keyword. `int local_v2 = v2;` allows the same optimization using Standard C++. – MSalters May 16 '18 at 09:51
  • @MSalters Or, passing `v2` by value. – Daniel Langr May 16 '18 at 09:58
  • 2
    @MSalters: Of course it does. In this minimal example. But what if you are in a lambda that is passed to a hot loop and `v2` is captured? Then you cannot put it onto the stack before the loop, as you (i.e., the lambda) don't control the loop. – gexicide May 16 '18 at 10:12
  • Well, you might consider expanding the example to a less minimal snippet. While loosing generality (of the question itself), you may find at least a workaround for your actual problem, I think. – Bob__ May 16 '18 at 11:27

1 Answers1

13

So many useless comments...

This seems to be a bug in Clang alias analyzer. If you change type of v2 to short compiler happily removes it from the loop based on type-based aliasing rules:

for.body:                                         ; preds = %for.inc, %for.body.lr.ph
  %i.09 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.inc ]
  %ptr1.addr.08 = phi i32* [ %ptr1, %for.body.lr.ph ], [ %ptr1.addr.1, %for.inc ]
  %1 = load i32, i32* %ptr1.addr.08, align 4, !tbaa !5
  %cmp1 = icmp eq i32 %1, %conv
  br i1 %cmp1, label %if.then, label %if.else

But with original loop you get the same alias set for both memory references, which is why middle-end can't optimize it:

  %i.08 = phi i32 [ %inc, %for.inc ], [ 0, %for.body.preheader ]
  %ptr1.addr.07 = phi i32* [ %ptr1.addr.1, %for.inc ], [ %ptr1, %for.body.preheader ]
  %0 = load i32, i32* %ptr1.addr.07, align 4, !tbaa !1
  %1 = load i32, i32* %v2, align 4, !tbaa !1
  %cmp1 = icmp eq i32 %0, %1
  br i1 %cmp1, label %if.then, label %if.else

Note the !tbaa !1 attached to both memory references which means that compiler couldn't distinguish memory accessed by either of them. It seems that restrict annotation has been lost along the way...

I encourage you to reproduce this with latest Clang and file a bug in LLVM Bugzilla (be sure to cc Hal Finkel).

yugr
  • 19,769
  • 3
  • 51
  • 96