7

I have the following problem (g++ (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4):

When I use _mm256_slli_si256() directly, such as:

__m256i x = _mm256_set1_epi8(0xff);
x = _mm256_slli_si256(x, 3);

the code compiles without problem (g++ -Wall -march=native -O3 -o shifttest shifttest.C).

However, if I wrap it into a function

__m256i doit(__m256i x, const int imm)
{
  return _mm256_slli_si256(x, imm);
}

the compiler complains that

/usr/lib/gcc/x86_64-linux-gnu/4.8/include/avx2intrin.h: In function '__m256i doit(__m256i, int)':
/usr/lib/gcc/x86_64-linux-gnu/4.8/include/avx2intrin.h:651:58: error: the last argument must be an 8-bit immediate
   return (__m256i)__builtin_ia32_pslldqi256 (__A, __N * 8);

regardless of whether the function is used or not.

This can't be a problem with the immediate operand, since the function doit() compiles if I use e.g. _mm256_slli_si32(x, imm) instead, and _mm256_slli_si32() also requires an immediate operand.

There is a related bug report on

https://gcc.gnu.org/bugzilla/show_bug.cgi?format=multiple&id=54825

but it is quite old (2012) and relates to gcc 4.8.0, so I thought the patch would be have been incorporated into g++ 4.8.4 already.

Is there a workaround for this problem?

Paul R
  • 208,748
  • 37
  • 389
  • 560
Ralf
  • 1,203
  • 1
  • 11
  • 20
  • Same for `_mm256_alignr_epi8()`, by the way. So no workaround using that one... – Ralf Jul 09 '15 at 12:46
  • And `_mm256_setr_m128i()` which would help with a workaround using 128-bit shifts is missing completely. Oh, and the same problem as described above occurs with `_mm_slli_si128()`, so that doesn't work either. Something about this `__N * 8` seems to confuse the compiler. – Ralf Jul 09 '15 at 13:18
  • There are two versions of the shift intrinsics for each instruction, one with an immediate arg, and the other with the shift count for all elements in the low bits of an `xmm` register. The two versions share an asm mnemonic, but are different. (AVX2 also introduced variable-shift instructions that take the shift count for each element separately, from the corresponding element in the shift-count register. Those instructions have a different asm mnemonic, as well as a different intrinsic function name.) Oops, there's no variable-count shift-whole-reg-by-bytes, nvm. – Peter Cordes Jul 09 '15 at 20:52

2 Answers2

11

The argument indicating the number of bits to shift must be a compile-time constant, as it is encoded as an immediate value in the instruction (i.e. not loaded from a register; the actual shift value is part of the instruction encoding). As long as you use it directly, like this:

__m256i x = _mm256_set1_epi8(0xff);
x = _mm256_slli_si256(x, 3);

then the compiler sees the shift value as a compile-time constant, 3. However, when in the context of your wrapping function:

__m256i doit(__m256i x, const int imm)
{
  return _mm256_slli_si256(x, imm);
}

there is no way for the compiler to infer the value of imm at compile time, which is required in order for it to synthesize the shift instruction. The fact that imm is a const int doesn't mean that its value is known at compile time, only that the semantics of the language don't allow it to be modified within the doit() function scope.

It's possible that if doit() were to be inlined by the compiler, then it may be able to statically determine the value of imm and therefore compile successfully, but that may be going too far out on a limb.

If you're using C++, another option would be to make doit() a function template with an argument indicating the shift size, like this:

template <int Shift>
__m256i doit(__m256i x)
{
  return _mm256_slli_si256(x, Shift);
}
Jason R
  • 11,159
  • 6
  • 50
  • 81
  • Oh hell, yes, I forgot the inline, you're absolutely right, thank you! It works if I add the inline. But why does it work without inline for other intrinsics? And you seem hesitant to rely on the inline solution: Do you think that this works is more a lucky coincidence than a feature? – Ralf Jul 09 '15 at 13:32
  • The `inline` keyword isn't a guarantee that a function will be inlined; it's only a hint to the compiler, so I do hesitate to rely upon it working for your application. If you're not concerned about total portability, many compilers have [special syntax that allow you to force inlining](http://stackoverflow.com/questions/8381293/how-do-i-force-gcc-to-inline-a-function). I'm not sure what other intrinsics you're referring to, but the bit/byte shifting instructions are somewhat unique in that they **require** immediate arguments. Most SSE/AVX instructions do not. – Jason R Jul 09 '15 at 13:34
  • Thanks. Are you sure about the immediate arguments? I thought, ALL integer parameters passed to SSE/AVX functions are immediate arguments (and thus encoded in the opcode itself)? E.g. `__m256i _mm256_slli_epi32 (__m256i a, int imm8)` from the Intel Intrinsics Reference. – Ralf Jul 09 '15 at 13:55
  • @Ralf: Yes, I'll walk that statement back a bit. In many cases the intrinsics that can take a variable will take it via a `__m256` type. If the reference indicates that it's an immediate, I would trust that. – Jason R Jul 09 '15 at 14:48
1

The problem is due to your function being public (i.e. callable by functions in other C/C++ modules). If you declare it as static (or inline), the compiler will not do code-generation for this function, and you won't get an error.

Marat Dukhan
  • 11,993
  • 4
  • 27
  • 41