This question relates to gcc (4.6.3 Ubuntu) and its behavior in unrolling loops for SSE intrinsics with immediate operands.
An example of an intrinsic with immediate operand is _mm_blend_ps. It expects a 4-bit immediate integer which can only be a constant. However, using the -O3 option, the compiler apparently automatically unrolls loops (if the loop counter values can be determined at compile time) and produces multiple instances of the corresponding blend instruction with different immediate values.
This is a simple test code (blendsimple.c) which runs through the 16 possible values of the immediate operand of blend:
#include <stdio.h>
#include <x86intrin.h>
#define PRINT(V) \
printf("%s: ", #V); \
for (i = 3; i >= 0; i--) printf("%3g ", V[i]); \
printf("\n");
int
main()
{
__m128 a = _mm_set_ps(1, 2, 3, 4);
__m128 b = _mm_set_ps(5, 6, 7, 8);
int i;
PRINT(a);
PRINT(b);
unsigned mask;
__m128 r;
for (mask = 0; mask < 16; mask++) {
r = _mm_blend_ps(a, b, mask);
PRINT(r);
}
return 0;
}
It is possible compile this code with
gcc -Wall -march=native -O3 -o blendsimple blendsimple.c
and the code works. Obviously the compiler unrolls the loop and inserts constants for the immediate operand.
However, if you compile the code with
gcc -Wall -march=native -O2 -o blendsimple blendsimple.c
you get the following error for the blend intrinsic:
error: the last argument must be a 4-bit immediate
Now I tried to find out which specific compiler flag is active in -O3 but not in -O2 which allows the compiler to unroll the loop, but failed. Following the gcc online docs at
https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/Overall-Options.html
I executed the following commands:
gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts | grep enabled
which lists all options enabled by -O3 but not by -O2. When I add all of the 7 listed flags in addition to -O2
gcc -Wall -march=native -O2 -fgcse-after-reload -finline-functions -fipa-cp-clone -fpredictive-commoning -ftree-loop-distribute-patterns -ftree-vectorize -funswitch-loops blendsimple blendsimple.c
I would expect that the behavior is exactly the same as with -O3. However, the compiler complains that "the last argument must be a 4-bit immediate".
Does anyone have an idea what the problem is? I think it would be good to know which flag is required to enable this type of loop unrolling so that it can be activated selectively using #pragma GCC optimize or by a function attribute.
(I was also surprised that -O3 obviously doesn't even enable the unroll-loops option).
I would be grateful for any help. This is for a lecture on SSE programming I give.
Edit: Thanks a lot for your comments. jtaylor seems to be right. I got my hand on two newer versions of gcc (4.7.3, 4.8.2), and 4.8.2 complains on the immediate problem regardless of the optimization level. Moverover, I later noticed that gcc 4.6.3 compiles the code with -O2 -funroll-loops, but this also fails in 4.8.2. So apparently one cannot trust this feature and should always unroll "manually" using cpp or templates, as Jason R pointed out.