Clang/GCC Compiler Intrinsics without corresponding compiler flag

Question

I know there are similar questions to this, but compiling different file with different flag is not acceptable solution here since it would complicate the codebase real quick. An answer with "No, it is not possible" will do.

Is it possible, in any version of Clang OR GCC, to compile intrinsics function for SSE 2/3/3S/4.1 while only enable compiler to use SSE instruction set for its optimization?

EDIT: For example, I want compiler to turn _mm_load_si128() to movdqa, but compiler must not do emit this instruction at any other place than this intrinsics function, similar to how MSVC compiler works.

EDIT2: I have dynamic dispatcher in place and several version of single function with different instruction sets written using intrinsics function. Using multiple file will make this much harder to maintain as same version of code will span multiple file, and there are a lot of this type of functions.

EDIT3: Example source code as requested: https://github.com/AviSynth/AviSynthPlus/blob/master/avs_core/filters/resample.cpp or most file in that folder really.

This isn't entirely clear. You want to automatically compile SSE2+ intrinsics as SSE1? — Oliver Charlesworth, Jan 26 '14 at 12:43
How would the use of different compiler flags make the code more complex? — , Jan 26 '14 at 12:43
You probably need to tune the ISA used, e.g. with `-mtune=native` flag to `gcc`. Do you accept that? You may want to invest efforts in the builder (e.g. have a complex `Makefile` for a recent `make`) — Basile Starynkevitch, Jan 26 '14 at 12:44
@H2CO3: separating code with different instruction sets to different file will make codebase much harder to maintain. — innocenat, Jan 26 '14 at 12:46
Why do you want to prohibit the use of some machine instruction outside of some builtin? Leave optimization freedom to the compiler! — Basile Starynkevitch, Jan 26 '14 at 12:46
@BasileStarynkevitch That would allow compiler to emit SSE2/3/3S/4.1/etc in other place too -- e.g. from autovectorisation — innocenat, Jan 26 '14 at 12:47
Yes, perhaps the compiler would autovectorize, and why is that unacceptable for you? — Basile Starynkevitch, Jan 26 '14 at 12:48
Please show real source code in your question. It is really confusing. — Basile Starynkevitch, Jan 26 '14 at 12:48
You might find the load-time [dynamic function resolver](http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html) feature to be useful - see the `ifunc` function attribute. It does depend on `glibc` and sufficient linker support however, so it's not exactly portable. — Brett Hale, Jan 29 '14 at 18:06

score 10 · Accepted Answer · answered Jan 27 '14 at 03:35

Here is an approach using gcc that might be acceptable. All source code goes into a single source file. The single source file is divided into sections. One section generates code according to the command line options used. Functions like main() and processor feature detection go in this section. Another section generates code according to a target override pragma. Intrinsic functions supported by the target override value can be used. Functions in this section should be called only after processor feature detection has confirmed the needed processor features are present. This example has a single override section for AVX2 code. Multiple override sections can be used when writing functions optimized for multiple targets.

// temporarily switch target so that all x64 intrinsic functions will be available
#pragma GCC push_options
#pragma GCC target ("arch=core-avx2")
#include <intrin.h>
// restore the target selection
#pragma GCC pop_options

//----------------------------------------------------------------------------
// the following functions will be compiled using default code generation
//----------------------------------------------------------------------------

int dummy1 (int a) {return a;}

//----------------------------------------------------------------------------
// the following functions will be compiled using core-avx2 code generation
// all x64 intrinc functions are available
#pragma GCC push_options
#pragma GCC target ("arch=core-avx2")
//----------------------------------------------------------------------------

static __m256i bitShiftLeft256ymm (__m256i *data, int count)
   {
   __m256i innerCarry, carryOut, rotate;

   innerCarry = _mm256_srli_epi64 (*data, 64 - count);                        // carry outs in bit 0 of each qword
   rotate     = _mm256_permute4x64_epi64 (innerCarry, 0x93);                  // rotate ymm left 64 bits
   innerCarry = _mm256_blend_epi32 (_mm256_setzero_si256 (), rotate, 0xFC);   // clear lower qword
   *data    = _mm256_slli_epi64 (*data, count);                               // shift all qwords left
   *data    = _mm256_or_si256 (*data, innerCarry);                            // propagate carrys from low qwords
   carryOut   = _mm256_xor_si256 (innerCarry, rotate);                        // clear all except lower qword
   return carryOut;
   }

//----------------------------------------------------------------------------
// the following functions will be compiled using default code generation
#pragma GCC pop_options
//----------------------------------------------------------------------------

int main (void)
    {
    return 0;
    }

//----------------------------------------------------------------------------

I will try that when I am back home. Some quick question though: does this also works with clang? And what version of GCC does this require? Thank you very much. — innocenat, Jan 27 '14 at 14:54
Hello @Nat, I don't know if this method works with clang. For gcc, it is supported by versions as old as 4.6.3 and possible older (4.6.3 is the oldest I have handy for testing). — , Jan 27 '14 at 19:45
I have tried, and this work magnificently. Though I am still interested in clang solution if anyone know. — innocenat, Jan 30 '14 at 01:35
Wow, this is a PITA. It seems like such an obvious use-case. Set minimum ISA to SSE2 (can be used anywhere without intrinsics) and then allow any intrinsics so you can create an SSE4.1 code path (for example) when it is available at runtime. — PatrickB, May 22 '14 at 20:54
@ScottD I can't make this work on https://gcc.godbolt.org/ Is the sample complete? — Bruno Martinez, Oct 25 '16 at 03:03

Mats Petersson · Answer 2 · 2014-01-26T14:39:47.650

There is no way to control instruction set used for the compiler, other than the switches on the compiler itself. In other words, there are no pragmas or other features for this, just the overall compiler flags.

This means that the only viable solution for achieving what you want is to use the -msseX and split your source into multiple files (of course, you can always use various clever #include etc to keep one single textfile as the main source, and just include the same file in multiple places)

Of course, the source code of the compiler is available. I'm sure the maintainers of GCC and Clang/LLVM will happily take patches that improve on this. But bear in mind that the path from "parsing the source" to "emitting instructions" is quite long and complicated. What should happen if we do this:

#pragma use_sse=1
void func()
{
   ... some code goes here ... 
}

#pragma use_sse=3
void func2()
{
  ...
  func();
  ...
}

Now, func is short enough to be inlined, should the compiler inline it? If so, should it use sse1 or sse3 instructions for func().

I understand that YOU may not care about that sort of difficulty, but the maintainers of Clang and GCC will indeed have to deal with this in some way.

Edit: In the headerfiles declaring the SSE intrinsics (and many other intrinsics), a typical function looks something like this:

extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_ss (__m128 __A, __m128 __B)
{
  return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
}

The builtin_ia32_addss is only available in the compiler when you have enabled the -msse option. So if you convince the compiler to still allow you to use the _mm_add_ss() when you have -mno-sse, it will give you an error for "__builtin_ia32_addss is not declared in this scope" (I just tried).

It would probably not be very hard to change this particular behaviour - there are probably only a few places where the code does the "introduce builtin functions". However, I'm not convinced that there are further issues in the code, later on when it comes to actually issuing instructions in the compiler.

I have done some work with "builtin functions" in a Clang-based compiler, and unfortunately, there are several steps involved in getting from the "parser" to the "code generation", where the builtin function gets involved.

Edit2:

Compared to GCC, solving this for Clang is even more complex, in that the compiler itself has understanding of SSE instructions, so it simply has this in the header file:

static __inline__ __m128 __attribute__((__always_inline__, __nodebug__))
_mm_add_ps(__m128 __a, __m128 __b)
{
  return __a + __b;
}

The compiler will then know that to add a couple of __m128, it needs to produce the correct SSE instruction. I have just downloaded Clang (I'm at home, my work on Clang is at work, and not related to SSE at all, just builtin functions in general - and I haven't really done much of the changes to Clang as such, but it was enough to understand roughly how builtin functions work).

However, from your perspective, the fact that it's not a builtin function makes it worse, because the operator+ translation is much more complicated. I'm pretty sure the compiler just makes it into an "add these two things", and then pass it to LLVM for further work - LLVM will be the part that understands SSE instructions etc. But for your purposes, this makes it worse, because the fact that this is an "intrinsic function" is now pretty much lost, and the compiler just deals with it just as if you'd written a + b, with the side effect of a and b being types that are 128 bits long. It makes it even more complicated to deal with generating "the right instructions" and yet keeping "all other" instructions at a different SSE level.

What amazes me is that MSVC and Intel Compiler perfectly accept this kind of usage, without even needing #pragma — innocenat, Jan 26 '14 at 13:02
Fine, use them then... I have a feeling that it's the other way around, actually, that they don't support autovectorization very well, so won't automatically generate SSE instructions "in the main code". — Mats Petersson, Jan 26 '14 at 13:08
The problem is that I want to port the Windows code using MSVC to Linux. And it's that way, I even have /arch:IA-32 specified. — innocenat, Jan 26 '14 at 13:22
So, just because you like it to be some particular way doesn't change the way that reality is. MS makes their compiler via their team, Intel has its compiler team, Clang and GCC are open source projects, which means anyone can contribute, but it's largely driven by a small number of contributors that either do it as a side-line along with their main work, as research work at a university, or as part of some "We as a company like you to work on this compiler stuff, because it helps us sell chips if the compiler can do this". So if you don't like it, talk to the people that make it. — Mats Petersson, Jan 26 '14 at 13:26
I only post first comment because you make it sound like it is not feasible to implement. And my second comment is in direct argument to your second comment. And when did I say I dont understand reality is? — innocenat, Jan 26 '14 at 13:32
Thank you. I actually know how the header file in gcc work, as I have tried it already. It's good to have insight into Clang though. — innocenat, Jan 26 '14 at 14:20
@Nat: I have added further info regarding clang. Unfortunately, not exactly helping matters... — Mats Petersson, Jan 26 '14 at 14:40

Clang/GCC Compiler Intrinsics without corresponding compiler flag

2 Answers2