19

Does gcc (latest versions: 4.8, 4.9) have an "assume" clause similar to __assume() built-in supported by icc? E.g., __assume( n % 8 == 0 );

manlio
  • 18,345
  • 14
  • 76
  • 126
user2052436
  • 4,321
  • 1
  • 25
  • 46
  • 3
    See: `__builtin_expect` ? https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html – Paul R Sep 04 '14 at 14:29
  • Looks like it is for branch prediction, I need to hint the vectorizer that loop count is a good number. – user2052436 Sep 04 '14 at 14:36
  • I don't have access to icc, is it the same as Visual C __assume()? (http://msdn.microsoft.com/en-us/library/1b3fsfxw.aspx) – Remo.D Sep 04 '14 at 15:06
  • 2
    From [here](http://en.chys.info/2010/07/counterpart-of-assume-in-gcc/): `#define __assume(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)`. I haven't tested, if Gcc uses this for optimization, though. – mafso Sep 04 '14 at 16:37

3 Answers3

23

As of gcc 4.8.2, there is no equivalent of __assume() in gcc. I don't know why -- it would be very useful. mafso suggested:

#define __assume(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)

This is an old trick, known at least as far back as 2010 and probably longer. The compiler usually optimizes out the evaluation of 'cond' because any evaluation for which cond is false would be undefined anyway. However, it does not seem to optimize away 'cond' if it contains a call to an opaque (non-inlined) function. The compiler must assume that the opaque call might have a side-effect (e.g., change a global) and cannot optimize away the call, although it could optimize away any computations and branches on the result. For this reason, the macro approach is a partial solution, at best.

Pablo Halpern
  • 941
  • 7
  • 12
  • 1
    Came across this issue as well in gcc 5.2 and 6.1. cond is not optimized away when underlying expression is opaque. Even if cond is wrapped in pure function, which you'd think compiler would be free to optimize away. Also I have not found a way yet to detect whether compiler optimizes away cond. Which means that it is possible to hurt performance using this macro by having compiler silently adding unnecessary code. With transparent expression the macro works very well though. – user377178 Aug 03 '16 at 20:25
  • @user377178 "Even if cond is wrapped in pure function" - this would be a serious perf issue, please [report](https://gcc.gnu.org/bugzilla/) is still reproduces. – yugr Dec 21 '19 at 21:46
  • 1
    This does not work to assume a number is a multiple of another: https://gcc.godbolt.org/z/oWsPo1q6z The first one works but the second one does nothing. Replacing the second one by the proposed solution or b = (b/16)*16 works, but those add instructions... – MappaM Nov 11 '22 at 13:03
  • 1
    @MappaM it works as of gcc 12.1, i.e, the compiler correctly optimizes on modulo returning 0. The optimization is even more impressive in gcc 13.1, where the entire loop is unrolled to almost nothing. Gcc 13.1 also has the `[[assume(cond)]]` attribute, which is the standard way of getting assumptions as of C++23. – Pablo Halpern Apr 28 '23 at 15:24
8

[[assume(...)]]; - a portable version added to C++23.

__attribute__((__assume__(...))); - added in GCC 13 alongside the above, useful for C code.

HolyBlackCat
  • 78,603
  • 9
  • 131
  • 207
4

In your example you want to inform the compiler that N is a multiple of 8. You can do this simply by inserting the line

N = N & 0xFFFFFFF8;

in your code (if N is a 32-bit integer). This doesn't change N, because N is a multiple of 8, but since GCC 4.9 the compiler seems to understand that N is a multiple of 8, after this line.

This is shown by the next example, in which two float vectors are added:

int add_a(float * restrict a, float * restrict b, int N)
{
    a = (float*)__builtin_assume_aligned(a, 32);
    b = (float*)__builtin_assume_aligned(b, 32);
    N = N & 0xFFFFFFF8; 
    for (int i = 0; i < N; i++){
        a[i] = a[i] + b[i];
    }
    return 0;
}


int add_b(float * restrict a, float * restrict b, int N)
{
    a = (float*)__builtin_assume_aligned(a, 32);
    b = (float*)__builtin_assume_aligned(b, 32);
    for (int i = 0; i < N; i++){
        a[i] = a[i] + b[i];
    }
    return 0;
}

With gcc -m64 -std=c99 -O3, gcc version 4.9, add_a compiles to the vectorized code

add_a:
  and edx, -8
  jle .L6
  sub edx, 4
  xor ecx, ecx
  shr edx, 2
  lea eax, [rdx+1]
  xor edx, edx
.L3:
  movaps xmm0, XMMWORD PTR [rdi+rdx]
  add ecx, 1
  addps xmm0, XMMWORD PTR [rsi+rdx]
  movaps XMMWORD PTR [rdi+rdx], xmm0
  add rdx, 16
  cmp ecx, eax
  jb .L3
.L6:
  xor eax, eax
  ret

With function add_b, more than 20 extra instructions are needed to handle the case that N is not a multiple of 8:

add_b:
  test edx, edx
  jle .L17
  lea ecx, [rdx-4]
  lea r8d, [rdx-1]
  shr ecx, 2
  add ecx, 1
  cmp r8d, 2
  lea eax, [0+rcx*4]
  jbe .L16
  xor r8d, r8d
  xor r9d, r9d
.L11:
  movaps xmm0, XMMWORD PTR [rdi+r8]
  add r9d, 1
  addps xmm0, XMMWORD PTR [rsi+r8]
  movaps XMMWORD PTR [rdi+r8], xmm0
  add r8, 16
  cmp ecx, r9d
  ja .L11
  cmp eax, edx
  je .L17
.L10:
  movsx r8, eax
  lea rcx, [rdi+r8*4]
  movss xmm0, DWORD PTR [rcx]
  addss xmm0, DWORD PTR [rsi+r8*4]
  movss DWORD PTR [rcx], xmm0
  lea ecx, [rax+1]
  cmp edx, ecx
  jle .L17
  movsx rcx, ecx
  add eax, 2
  lea r8, [rdi+rcx*4]
  cmp edx, eax
  movss xmm0, DWORD PTR [r8]
  addss xmm0, DWORD PTR [rsi+rcx*4]
  movss DWORD PTR [r8], xmm0
  jle .L17
  cdqe
  lea rdx, [rdi+rax*4]
  movss xmm0, DWORD PTR [rdx]
  addss xmm0, DWORD PTR [rsi+rax*4]
  movss DWORD PTR [rdx], xmm0
.L17:
  xor eax, eax
  ret
.L16:
  xor eax, eax
  jmp .L10

See Godbolt link.

wim
  • 3,702
  • 19
  • 23