Compiling Intrinsics wrapper for generating platform specific code

Question

I have a generically built binary that needs to include a lookup routine which gets compiled into vectorized instructions or otherwise based upon whether the cpu supports avx/avx2.

The lookup routine is same as that explained here : Check all bytes of a __m128i for a match of a single byte using SSE/AVX/AVX2

Here the (_mm_set1_epi8, __mm_cmpeq_epi8,_mm_movemask_epi8) intrinsic set will compile into either vectorized instructions if avx/avx2 is supported by the cpu or just sse based instructions, otherwise.

in a oversimplified main.c : compiled without mavx/mavx2 and with -msse3 -msse4 -o 3

#define __SSE2__
#define SSE_Lookup() \      /*psuedo code*/
_mm_set1_epi8; \
__mm_cmpeq_epi8; \
match_bitmap=_mm_movemask_epi8
#endif

static inline __attribute__((always_inline))
uint64_t foo()
{
  unsigned int a=1,b,c,d;
  uint64_t match_bitmap;

  __cpuid(1,a,b,c,d);
  if(c & bit_AVX)
  {
       match_bitmap= avx_lookup();  
  }else
  {
  #if __SSE__
       SSE_Lookup();
  #endif
  }   
}

foo_avx.c

#include <emmintrin.h>

//mimicing an intrinsic wrapper
//don't want to create any new stack frames so keeping it inline

extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__))
__avx_lookup (char kk, __m128i h)
{
   __m128i k = _mm_set1_epi8(kk);
   __m128i r = _mm_cmpeq_epi8(k,h);
   return _mm_movemask_epi8(r);
}

compiled with x86_64_gcc-7.5.0_glibc/bin/x86_64-openwrt-linux-gnu-gcc

enwrt-linux-gnu/lib/Scrt1.o: in function _start': (.text+0x20): undefined reference to main' collect2: error: ld returned 1 exit status

Makefile:72: recipe for target '/build/x86_64/common/foo_avx.o' failed make[3]: *** [/build/x86_64/common/foo_avx.o] Error 1

So questions are :

Is the approach correct in defining the intrinsic wrapper that can be compiled with platform specific gcc options
Is there a better way of doing this ? goal is to have an executable with code for sse , avx as well as avx2 avx512 embedded that can be invoked based upon the cpu support at the run time.

Thanks in advance.

-J

Update: I also tried to add the __avx_lookup signature in a header file for other source files to see it. but that doesn't seem to be work.

You left out the useful part of your compiler error messages, only the summary from `make` itself at the end. A proper [mcve] would show those, like "inlining failed" because "target specific options mismatch" or something, if you used an intrinsic for an instruction-set you didn't enable on the command line, or for that function with `__attribute__((target("avx2")))` or whatever. Your code (including `__avx_lookup`) doesn't require AVX so should compile with baseline x86-64 (SSE2). BTW, don't use names with 2 leading underscores for your functions; names like that are reserved. — Peter Cordes, Feb 23 '23 at 00:29
Functions with different `target` options can't inline into each other, so you need to use the same target options for small helper functions. (IDK if GCC knows about subset / superset of target options, e.g. if it can inline an SSE2 function into an AVX function.) Dispatching once for the whole search, not branching or function pointers inside an inner loop. But you also need to do CPUID once at program startup and remember the results; `cpuid` itself is pretty slow and is serializing. (*Very* slow in a VM, it's a VM exit.) — Peter Cordes, Feb 23 '23 at 00:34
So i can compile with -msse3 -msse4 and then give __attribute__((target("avx2"))) to tell compiler that this part needs to be compiled with -mavx and -mavx2 ? That's interesting — Jay D, Feb 24 '23 at 00:58
The hierarchy i have is : parent inline function needs to invoke either avx_lookup() or sse_lookup() based upon whether CPU supports avx at run time. ? can these 2 *_lookup() be inline ? — Jay D, Feb 24 '23 at 01:15
yes the cpuid has been done at the beginning and the results cached in a variable that is being used everytime for run time avx decision. Interestingly the compiler doesn't give any errors about the inline __avx_lookup itself. — Jay D, Feb 24 '23 at 01:16
*Interestingly the compiler doesn't give any errors about the inline __avx_lookup itself.* - Like I said, that's because you don't use any AVX intrinsics in that function, only SSE2. If compiling with AVX enabled, the compiler will use `vpmovmskb eax, xmm0` or similar, but without AVX it can use `pmovmskb`. The `_mm_movemask_epi8` intrinsic only requires SSE2. So there's no point dispatching for AVX vs. SSE versions of *that* building block. Did you perhaps want to use `__m256i` variables with AVX2 `_mm256_cmpeq_epi8` and `_mm256_movemask_epi8`? — Peter Cordes, Feb 24 '23 at 01:25
yes there is another inline routine (or rather an intrinsic wrapper ) avx2_lookup that has 256bit avx2 vectorized lookup . for 128 bit i want to be able to choose avx Vs sse based upon cpu capabilities at the run time. if i understand your comment above , that's not feasible/possible ? mm_movemask_epi8 : with sse should compile to pmovmskb and with avx should compile to vpmovmskb, isn't it ? It's a vectorized move with avx — Jay D, Feb 25 '23 at 19:49
You can compile the same intrinsics with or without `-mavx` (or appropriate `target` attribute), it's just not interesting or surprising that there's no error without `-mavx`. The main advantage to AVX for this function is being able to fold an unaligned load into a memory source operand for `vpcmpeqb`, and to compare into a different destination register, not overwriting the `_mm_set1_epi8` result. But you want this tiny function to inline into a loop; it's too small on its own to be worth dispatching on. You don't want to have to redo the `vpshufb` or `vpbroadcastb` for `set1` every call. — Peter Cordes, Feb 25 '23 at 20:00

Compiling Intrinsics wrapper for generating platform specific code

0 Answers0