0

I'm really noob about intrinsics, simd and in general low level programming. I'm moving the first steps, but for what I see, all intrinsics I'm using (Intel ones right now) are simply C++ generic code, nothing "special" or dedicated keywords.

It seems a sort of agreement between those "list of functions" and compiler, such as telling to compiler that if I use somethings like this:

__m128d vA = _mm_load_pd(a);

it should treat the vA variable as XMM register instead of allocate it into memory. But its not guaranteed (since __m128d, in the end, its a C++ union/struct object, which could reside in memory).

Am I right? Or there is further black magic under the hood?

And how can Compiler treat those functions "in some ways" instead of generic functions? Rules matched by parsing the code? Somethings like this?

Its extremely fascinating for a Web Developer :)

sepp2k
  • 363,768
  • 54
  • 674
  • 675
markzzz
  • 47,390
  • 120
  • 299
  • 507
  • It would be a bug for *that compiler* to do something different from simply emitting the specified instructions there. I wouldn't call that a "suggestion". – Caleth Dec 13 '18 at 13:30
  • 1
    Like I said in an edit to an answer on your previous question ([What is \_\_m128d?](https://stackoverflow.com/q/53757633)), a `__m128d` object is not fundamentally different from an `int` object in C or C++. All variables logically have an address, but the memory storage can be optimized away if the address of the variable isn't taken. (Or if it's taken but not used in a way that could let code other code do anything with it.) Look at compiler asm output on https://godbolt.org/ (see also [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116)). – Peter Cordes Dec 13 '18 at 13:38
  • Let say I use MSVC. But those Intrinsics are made from Intel, not from Microsoft. So I believe that first they released the library, than all compilers "adapt" to them? – markzzz Dec 13 '18 at 13:39
  • 1
    There was never a library, AFAIK. Compilers either implemented them as intrinsics (or possibly with inline asm, but that would defeat optimizations like constant-propagation and CSE), or they didn't implement them at all. It wouldn't be worthwhile to implement Intel's intrinsics API as a library of non-inline functions that had `call`/`ret` overhead and passed `__m128d` objects as ordinary aggregate types (`struct`/`union`) on a compiler that couldn't generate SSE instructions. – Peter Cordes Dec 13 '18 at 13:43
  • @PeterCordes so they are not "generic" functions that contain code that will be "optimized" as SIMD call. They are a sort of "transparent" mechanisc, with a proper Compiler implementation of them (which will differs on each compiler). i.e. Intel didn't release a emmintrin.cpp with definition of those functions :) – markzzz Dec 13 '18 at 14:19
  • Think of intrinsics more like operators. The compiler understands how they work. GNU C literally allows you to write `a+b` on vector types instead of `_mm_add_pd(a,b)`, and the intrinsic is actually defined as a simple inline function that uses that operator. The compiler knows what addition is, and e.g. knows that adding zero is a no-op can be optimized away. (at least with `-ffast-math` because strict FP is tricky). Things other than add still behave the same way as far as the optimizer is concerned, so the compiler can optimize intrinsics (e.g. clang has a a good shuffle optimizer). – Peter Cordes Dec 13 '18 at 14:29
  • @PeterCordes saying this way, it seems that compiler is so smarter that intrinsics are not needed at all! But its not that way: if I set MSVC with fast optimization and I write usual code instead of using intrinsics, it seems really slow. So "somethings" they have to add more than "general" + operations... – markzzz Dec 13 '18 at 14:38
  • 1
    Compilers can't always auto-vectorize scalar code. Intrinsics allow you to express manually-vectorized algorithms in C or C++. Just like `+` on `float` operands compiles differently from `+` on `int` operands, `_mm_add_pd` on `__m128d` operands does two packed `double` additions on the two halves of the __m128d vector object. (Typically using an `addpd` instruction, unless it optimizes into something else.) – Peter Cordes Dec 13 '18 at 14:44
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/185215/discussion-between-markzzz-and-peter-cordes). – markzzz Dec 13 '18 at 14:50

1 Answers1

3

You are really asking two different questions:

(1) How does the compiler decide where to put my SIMD variables? In memory or in register?

(2) How specific is the 'contract' for an intrinsic? Does it always emit a specific instruction?

The answer to the first question is really no different for SIMD than for any other kind of variable. In C/C++, you usually use automatic variables because those are the most likely to end up in register. The compiler is free to schedule the actual instructions and register usage based on the context, and will often move data in and out of registers to 'stack memory' depending on how much 'register pressure' there is in code.

This flexibility is a "good thing" compared to writing it in assembly where you the programmer decide exactly what registers are being used when and exactly what order the instructions are executed in. Often the compiler can mix in other nearby code or do other optimizations that are difficult to keep straight, and it can take advantage of architecture differences. For example, in DirectXMath I have written the same intrinsics code for both x86 (32-bit) and x64 (64-bit), and the compiler can make use of the 8 extra registers available in x64. If I was using inline assembly, I'd have to write it two different ways and probably more than that with some additional differences I'll come to shortly.

When writing SIMD code, you really want to maximize the work with data already in a register because the load/store overhead to memory often costs as much performance as you get from doing a few SIMD instructions vs. scalar. As such, you will usually write SIMD intrinsics to do an explicit load into a bunch of 'automatic variables' but keep in mind that likely only 8 or so of them are going to really be in a register at a time. You do want to do enough work that the compiler can fill in the gaps. You then store the result to memory. As such, you really don't do stuff like auto a = new __m128d;. There's also the additional complexity of the implied aligment (__m128d must be 16-byte aligned, and while x64 new does that x86 new does not).

The second answer is a bit more complicated. A given intrinsic is usually defined as a given instruction, and some instrinsics are really combos of instructions, but the compiler may choose to use some knowledge of the target platform when picking the exact instruction. Here are a few examples:

  • __m128 _mm_add_ps (__m128 a, __m128 b) is defined as the SSE instruction addps and is often emitted as such. But if you are building with /arch:AVX or /arch:AVX2 the compiler will use the VEX prefix and the instruction vaddps.

  • __m128d _mm_fmadd_pd (__m128d a, __m128d b, __m128d c) is defined as the FMA3 instruction but the compiler can actually emit vfmadd132pd, vfmadd213pd, or vfmadd231pd depending on the exactly register use. In fact, the compiler can even just decide it's faster to use a vmulpd followed by a vaddpd which does the same thing depending on the exact instruction timing of the hardware instruction cost functions it is using.

Note that while it is certainly possible for the compiler implementer to decide say that they could optimize __m128 _mm_shuffle_ps (__m128 a, __m128 b, unsigned int imm8) where the registers a and b are the same and choose emit a vpermilps instead of a shufps if you are building with /arch:AVX. That would be 'in contract' with the intrinsic. In practice, however, intrinsics tend to be treated a bit special and strongly prefer the instruction they are defined as because you often use them in particular contexts based on hardware feature detection. So you normally can count on a particular instrinic to end up being the instruction you expect or a very close variant of it.

So in short, all of C/C++ is a 'hint' to the compiler in the sense that the source code describes the exact computation you want, but the compiler is free to actually emit code that achieves the same result but can be in different order or use different instructions than the ones you might assume.

The Intel Intrinsics Guide is a good resource for exploring intrinsics.

You might also find some of my blog posts related to intrinsics useful.

The DirectXMath Programmer's Guide also has some useful tricks & tips for intrinsics usage sprinkled throughout so it's worth a read and it's only 6 pages so it won't take that long. See Microsoft Docs

Community
  • 1
  • 1
Chuck Walbourn
  • 38,259
  • 2
  • 58
  • 81
  • 1
    `vfmadd*pd` is not the same as a `vmulpd` followed by a `vaddpd`, since the first will not round the intermediate result. But other than that, of course the 'as if' rule still applies: If a compiler finds a way to achieve the same as-if executing what the programmer wrote, it can emit the corresponding code. – chtz Dec 14 '18 at 23:42
  • ICC and MSVC strongly prefer to emit the instruction matching the intrinsic. GCC mostly prefers. Clang/LLVM doesn't give a crap and always uses its shuffle optimizer, and will do stuff like combining a chain of integer `add`s into a left-shift. – Peter Cordes Dec 15 '18 at 08:13