Does gcc (latest versions: 4.8, 4.9) have an "assume" clause similar to __assume()
built-in supported by icc?
E.g., __assume( n % 8 == 0 );

- 18,345
- 14
- 76
- 126

- 4,321
- 1
- 25
- 46
-
3See: `__builtin_expect` ? https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html – Paul R Sep 04 '14 at 14:29
-
Looks like it is for branch prediction, I need to hint the vectorizer that loop count is a good number. – user2052436 Sep 04 '14 at 14:36
-
I don't have access to icc, is it the same as Visual C __assume()? (http://msdn.microsoft.com/en-us/library/1b3fsfxw.aspx) – Remo.D Sep 04 '14 at 15:06
-
2From [here](http://en.chys.info/2010/07/counterpart-of-assume-in-gcc/): `#define __assume(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)`. I haven't tested, if Gcc uses this for optimization, though. – mafso Sep 04 '14 at 16:37
3 Answers
As of gcc 4.8.2, there is no equivalent of __assume() in gcc. I don't know why -- it would be very useful. mafso suggested:
#define __assume(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)
This is an old trick, known at least as far back as 2010 and probably longer. The compiler usually optimizes out the evaluation of 'cond' because any evaluation for which cond is false would be undefined anyway. However, it does not seem to optimize away 'cond' if it contains a call to an opaque (non-inlined) function. The compiler must assume that the opaque call might have a side-effect (e.g., change a global) and cannot optimize away the call, although it could optimize away any computations and branches on the result. For this reason, the macro approach is a partial solution, at best.

- 941
- 7
- 12
-
1Came across this issue as well in gcc 5.2 and 6.1. cond is not optimized away when underlying expression is opaque. Even if cond is wrapped in pure function, which you'd think compiler would be free to optimize away. Also I have not found a way yet to detect whether compiler optimizes away cond. Which means that it is possible to hurt performance using this macro by having compiler silently adding unnecessary code. With transparent expression the macro works very well though. – user377178 Aug 03 '16 at 20:25
-
@user377178 "Even if cond is wrapped in pure function" - this would be a serious perf issue, please [report](https://gcc.gnu.org/bugzilla/) is still reproduces. – yugr Dec 21 '19 at 21:46
-
1This does not work to assume a number is a multiple of another: https://gcc.godbolt.org/z/oWsPo1q6z The first one works but the second one does nothing. Replacing the second one by the proposed solution or b = (b/16)*16 works, but those add instructions... – MappaM Nov 11 '22 at 13:03
-
1@MappaM it works as of gcc 12.1, i.e, the compiler correctly optimizes on modulo returning 0. The optimization is even more impressive in gcc 13.1, where the entire loop is unrolled to almost nothing. Gcc 13.1 also has the `[[assume(cond)]]` attribute, which is the standard way of getting assumptions as of C++23. – Pablo Halpern Apr 28 '23 at 15:24
[[assume(...)]];
- a portable version added to C++23.
__attribute__((__assume__(...)));
- added in GCC 13 alongside the above, useful for C code.

- 78,603
- 9
- 131
- 207
In your example you want to inform the compiler that N
is a multiple of 8.
You can do this simply by inserting the line
N = N & 0xFFFFFFF8;
in your code (if N
is a 32-bit integer). This doesn't change N
, because N
is a multiple of 8,
but since GCC 4.9 the compiler
seems to understand that N
is a multiple of 8, after this line.
This is shown by the next example, in which two float vectors are added:
int add_a(float * restrict a, float * restrict b, int N)
{
a = (float*)__builtin_assume_aligned(a, 32);
b = (float*)__builtin_assume_aligned(b, 32);
N = N & 0xFFFFFFF8;
for (int i = 0; i < N; i++){
a[i] = a[i] + b[i];
}
return 0;
}
int add_b(float * restrict a, float * restrict b, int N)
{
a = (float*)__builtin_assume_aligned(a, 32);
b = (float*)__builtin_assume_aligned(b, 32);
for (int i = 0; i < N; i++){
a[i] = a[i] + b[i];
}
return 0;
}
With gcc -m64 -std=c99 -O3
, gcc version 4.9, add_a
compiles to the vectorized code
add_a:
and edx, -8
jle .L6
sub edx, 4
xor ecx, ecx
shr edx, 2
lea eax, [rdx+1]
xor edx, edx
.L3:
movaps xmm0, XMMWORD PTR [rdi+rdx]
add ecx, 1
addps xmm0, XMMWORD PTR [rsi+rdx]
movaps XMMWORD PTR [rdi+rdx], xmm0
add rdx, 16
cmp ecx, eax
jb .L3
.L6:
xor eax, eax
ret
With function add_b
, more than 20 extra instructions are needed to handle the case that
N
is not a multiple of 8:
add_b:
test edx, edx
jle .L17
lea ecx, [rdx-4]
lea r8d, [rdx-1]
shr ecx, 2
add ecx, 1
cmp r8d, 2
lea eax, [0+rcx*4]
jbe .L16
xor r8d, r8d
xor r9d, r9d
.L11:
movaps xmm0, XMMWORD PTR [rdi+r8]
add r9d, 1
addps xmm0, XMMWORD PTR [rsi+r8]
movaps XMMWORD PTR [rdi+r8], xmm0
add r8, 16
cmp ecx, r9d
ja .L11
cmp eax, edx
je .L17
.L10:
movsx r8, eax
lea rcx, [rdi+r8*4]
movss xmm0, DWORD PTR [rcx]
addss xmm0, DWORD PTR [rsi+r8*4]
movss DWORD PTR [rcx], xmm0
lea ecx, [rax+1]
cmp edx, ecx
jle .L17
movsx rcx, ecx
add eax, 2
lea r8, [rdi+rcx*4]
cmp edx, eax
movss xmm0, DWORD PTR [r8]
addss xmm0, DWORD PTR [rsi+rcx*4]
movss DWORD PTR [r8], xmm0
jle .L17
cdqe
lea rdx, [rdi+rax*4]
movss xmm0, DWORD PTR [rdx]
addss xmm0, DWORD PTR [rsi+rax*4]
movss DWORD PTR [rdx], xmm0
.L17:
xor eax, eax
ret
.L16:
xor eax, eax
jmp .L10
See Godbolt link.

- 3,702
- 19
- 23
-
4This also causes the compiler to insert code to do the bitwise and. – David Stone Oct 05 '19 at 03:43
-
Isn't there a better solution without adding instructions? Just hinting the compiler? – MappaM Nov 11 '22 at 13:02
-
1Yes, there is. You can add `assume(!(N & 0x7));` line. [Godbolt link](https://gcc.godbolt.org/z/5n6fWTrxx) – fdermishin Nov 22 '22 at 13:48