What are the errors in these avx2 intrinsics, and how to use the raw assembler

Question

The Intel reference manual is fairly opaque on the details of how instructions are used. There are no examples on each instruction page.

https://software.intel.com/sites/default/files/4f/5b/36945

It's possible we haven't found the right page, or the right manual yet. In any case, the syntax of gnu as might be different.

So I thought perhaps we could call intrinsics, step in and view the opcodes in gdb, and from that perhaps learn the opcodes in gnu as. we also searched for the opcodes for avx2 in gnu as but didn't find them documented.

Starting with the c++ code that isn't compiling:

#include <immintrin.h>
int main() {
  __m256i a = _mm256_set_epi32(1, 2, 3, 4, 5, 6, 7, 8); // seems to work
  double x = 1.0, y = 2.0, z = 3.0;
  __m256d c,d;
  __m256d b = _mm256_loadu_pd(&x);
  //  __m256d c = _mm256_loadu_pd(&y);
  //  __m256d d = _mm256_loadu_pd(&z);
  d = _mm256_fmadd_pd(b, c, d); // c += a * b
  _mm256_add_pd(b, c);
}

g++ -g We would like to be able to load a vector register %ymm0 with a single value, which appears to be supported by the intrinsic: _mm256_loadu_pd. Fused multiply-add and add are also there. All intrinsics except the first give errors.

/usr/lib/gcc/x86_64-linux-gnu/9/include/fmaintrin.h:47:1: error: inlining failed in call to always_inline ‘__m256d _mm256_fmadd_pd(__m256d, __m256d, __m256d)’: target specific option mismatch
   47 | _mm256_fmadd_pd (__m256d __A, __m256d __B, __m256d __C)
      | ^~~~~~~~~~~~~~~

Next, what are the syntax of the underlying assembler instructions? If you could point to a manual showing them that would be very helpful.

If you don't understand the general format of entries in Intel's ISA reference manual (which that 2011 AVX manual follows), see the x86 SDM vol.2. https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html#three-volume, specifically the intro chapters. (The vol.2 manual includes all current instructions, including AVX / FMA). But that of course documents Intel syntax. AT&T syntax is just a matter of reversing the operand list and decorating register names with `%`, like `%ymm0` (https://stackoverflow.com/tags/att/info) — Peter Cordes, Jul 28 '20 at 22:03
`_mm256_loadu_pd` loads a whole 256bit vector, it should not be called on a single `double` (even though the type signature suggests that you can do that) — harold, Jul 28 '20 at 22:05
If you need tutorials on what instruction to use when, see https://stackoverflow.com/tags/sse/info and https://stackoverflow.com/tags/avx/info. (And Agner Fog's asm guide chapter on SSE.) Also note that FMA is a separate instruction set from AVX, but none of your intrinsics require AVX2, only AVX or AVX+FMA. Throwing random intrinsics you don't seem to understand at the compiler seems less likely to be helpful than compiling working examples from SO answers; many of them include Godbolt links, e.g. [Shuffling by mask with Intel AVX](https://stackoverflow.com/q/50098902) — Peter Cordes, Jul 28 '20 at 22:06
Or [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) has some hand-written examples in NASM syntax, and IIRC some C that GCC will auto-vectorize with FMA, with a Godbolt link in the question. Also [Issues of compiler generated assembly for intrinsics](https://stackoverflow.com/q/40416570) has some Godbolt links with FMA. — Peter Cordes, Jul 28 '20 at 22:10

What are the errors in these avx2 intrinsics, and how to use the raw assembler

0 Answers0