7

This question relates to gcc (4.6.3 Ubuntu) and its behavior in unrolling loops for SSE intrinsics with immediate operands.

An example of an intrinsic with immediate operand is _mm_blend_ps. It expects a 4-bit immediate integer which can only be a constant. However, using the -O3 option, the compiler apparently automatically unrolls loops (if the loop counter values can be determined at compile time) and produces multiple instances of the corresponding blend instruction with different immediate values.

This is a simple test code (blendsimple.c) which runs through the 16 possible values of the immediate operand of blend:

#include <stdio.h>
#include <x86intrin.h>

#define PRINT(V)                \
  printf("%s: ", #V);               \
  for (i = 3; i >= 0; i--) printf("%3g ", V[i]);    \
  printf("\n");

int
main()
{
  __m128 a = _mm_set_ps(1, 2, 3, 4);
  __m128 b = _mm_set_ps(5, 6, 7, 8);
  int i;
  PRINT(a);
  PRINT(b);
  unsigned mask;
  __m128 r;
  for (mask = 0; mask < 16; mask++) {
    r = _mm_blend_ps(a, b, mask);
    PRINT(r);
  }
  return 0;
}

It is possible compile this code with

gcc -Wall -march=native -O3 -o blendsimple blendsimple.c

and the code works. Obviously the compiler unrolls the loop and inserts constants for the immediate operand.

However, if you compile the code with

gcc -Wall -march=native -O2 -o blendsimple blendsimple.c

you get the following error for the blend intrinsic:

error: the last argument must be a 4-bit immediate

Now I tried to find out which specific compiler flag is active in -O3 but not in -O2 which allows the compiler to unroll the loop, but failed. Following the gcc online docs at

https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/Overall-Options.html

I executed the following commands:

gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts | grep enabled

which lists all options enabled by -O3 but not by -O2. When I add all of the 7 listed flags in addition to -O2

gcc -Wall -march=native -O2 -fgcse-after-reload -finline-functions -fipa-cp-clone -fpredictive-commoning -ftree-loop-distribute-patterns -ftree-vectorize -funswitch-loops blendsimple blendsimple.c

I would expect that the behavior is exactly the same as with -O3. However, the compiler complains that "the last argument must be a 4-bit immediate".

Does anyone have an idea what the problem is? I think it would be good to know which flag is required to enable this type of loop unrolling so that it can be activated selectively using #pragma GCC optimize or by a function attribute.

(I was also surprised that -O3 obviously doesn't even enable the unroll-loops option).

I would be grateful for any help. This is for a lecture on SSE programming I give.

Edit: Thanks a lot for your comments. jtaylor seems to be right. I got my hand on two newer versions of gcc (4.7.3, 4.8.2), and 4.8.2 complains on the immediate problem regardless of the optimization level. Moverover, I later noticed that gcc 4.6.3 compiles the code with -O2 -funroll-loops, but this also fails in 4.8.2. So apparently one cannot trust this feature and should always unroll "manually" using cpp or templates, as Jason R pointed out.

Ralf
  • 1,203
  • 1
  • 11
  • 20
  • I get the error `the last argument must be a 4-bit immediate` even with -O3. – Z boson Jul 18 '14 at 12:24
  • You could always implement the unrolling manually using either preprocessor trickery or template metaprogramming (if you're writing in C++). – Jason R Jul 18 '14 at 15:44
  • 1
    this behavior looks more like a compiler bug (which is fixed in 4.8) compilers are not supposed to give errors on different optimization levels. gcc should either always support non immediates (e.g. via conditionals) or never. It seems they chose the later in later versions. Which makes sense, intrinsics are supposed to be very thin wrappers around machine instructions – jtaylor Jul 18 '14 at 20:17
  • 2
    My policy of "DTTC = don't trust the compiler" is usually the right answer. Since you know you need to unroll the loop, just unroll it. – BitBank Dec 31 '14 at 09:53

1 Answers1

1

I am not sure if this applies to your situation, since I am not familiar with SSE intrinsics. But generally, you can tell the compiler to specifically optimize a section of code with :

 #pragma GCC push_options
 #pragma GCC optimize ("unroll-loops")

 do your stuff

 #pragma GCC pop_options

Source: Tell gcc to specifically unroll a loop

Community
  • 1
  • 1
pAndrei
  • 383
  • 6
  • 19