0

I'm trying to create a minimal reproducer for this issue report. There seems to be some problems with AVX-512, which is shipping on the latest Apple machines with Skylake processors.

According to GCC6 release notes the AVX-512 gear should be available. According to the Intel Intrinsics Guide vmovdqu64 is available with AVX-512VL and AVX-512F:

$ cat test.cxx
#include <cstdint>
#include <immintrin.h>
int main(int argc, char* argv[])
{
    uint64_t x[8];
    __m512i y = _mm512_loadu_epi64(x);
    return 0;
}

And then:

$ /opt/local/bin/g++-mp-6 -mavx512f -Wa,-q test.cxx -o test.exe
test.cxx: In function 'int main(int, char**)':
test.cxx:6:37: error: '_mm512_loadu_epi64' was not declared in this scope
     __m512i y = _mm512_loadu_epi64(x);
                                     ^
$ /opt/local/bin/g++-mp-6 -mavx -mavx2 -mavx512f -Wa,-q test.cxx -o test.exe
test.cxx: In function 'int main(int, char**)':
test.cxx:6:37: error: '_mm512_loadu_epi64' was not declared in this scope
     __m512i y = _mm512_loadu_epi64(x);
                                     ^
$ /opt/local/bin/g++-mp-6 -msse4.1 -msse4.2 -mavx -mavx2 -mavx512f -Wa,-q test.cxx -o test.exe
test.cxx: In function 'int main(int, char**)':
test.cxx:6:37: error: '_mm512_loadu_epi64' was not declared in this scope
     __m512i y = _mm512_loadu_epi64(x);
                                     ^

I walked the options back to -msse2 without success. I seem to be missing something.

What is required to engage AVX-512 for modern GCC?


According to a /opt/local/bin/g++-mp-6 -v, these are the header search paths:

#include "..." search starts here:
#include <...> search starts here:
 /opt/local/include/gcc6/c++/
 /opt/local/include/gcc6/c++//x86_64-apple-darwin13
 /opt/local/include/gcc6/c++//backward
 /opt/local/lib/gcc6/gcc/x86_64-apple-darwin13/6.5.0/include
 /opt/local/include
 /opt/local/lib/gcc6/gcc/x86_64-apple-darwin13/6.5.0/include-fixed
 /usr/include
 /System/Library/Frameworks
 /Library/Frameworks

And then:

$ grep -R '_mm512_' /opt/local/lib/gcc6/ | grep avx512f | head -n 8
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:_mm512_set_epi64 (long long __A, long long __B, long long __C,
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:_mm512_set_epi32 (int __A, int __B, int __C, int __D,
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:_mm512_set_pd (double __A, double __B, double __C, double __D,
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:_mm512_set_ps (float __A, float __B, float __C, float __D,
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:#define _mm512_setr_epi64(e0,e1,e2,e3,e4,e5,e6,e7)                       \
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:  _mm512_set_epi64(e7,e6,e5,e4,e3,e2,e1,e0)
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:#define _mm512_setr_epi32(e0,e1,e2,e3,e4,e5,e6,e7,                       \
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:  _mm512_set_epi32(e15,e14,e13,e12,e11,e10,e9,e8,e7,e6,e5,e4,e3,e2,e1,e0)
...
jww
  • 97,681
  • 90
  • 411
  • 885
  • 2
    Works on [clang with `-mavx512f`](https://godbolt.org/z/hTKWcI). Try to search through the headers for the related intrinsics and see if you can find the function. If you can't find it you either need a newer version of gcc or chances are it just hasn't been implemented yet. –  Dec 04 '18 at 02:52
  • With no masking, there's no reason for this intrinsic to exist or to ever use it. It's just confusing vs. `_mm512_loadu_si512`. Intel's guide does specify that it exists (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,AVX_512,SVML,Other&expand=2403,4147&text=_mm512_loadu_), but maybe gcc doesn't define it. SSE* and AVX1/2 options are irrelevent to whether or not GCC headers define this intrinsic in terms of gcc built-ins or not; `-mavx512f` already implies all of the Intel SSE/AVX extensions before AVX512. – Peter Cordes Dec 04 '18 at 03:05
  • @user2176127 Not present on gcc trunk on Godbolt, and in clang it's only present in trunk, not 7.0. The aligned version is present in both, though: `_mm512_load_epi64`. And they They both have the `maskz` version. Pretty weird, but like I said super easy to avoid by just using the si512 version. – Peter Cordes Dec 04 '18 at 03:13
  • Yeah, grepping the GCC include directories are not returning hits for `$ grep -R '_mm512_' /opt/local/lib/gcc6/ | grep _mm512_loadu_epi64`. – jww Dec 04 '18 at 03:16
  • @Peter - I think `_mm512_load_epi64` is close enough. I can get the assembler error message using it. (I am trying to reproduce the integrated assembler error). – jww Dec 04 '18 at 03:18

2 Answers2

2

With no masking, there's no reason for this intrinsic to exist or to ever use it instead of the equivalent _mm512_loadu_si512. It's just confusing, and could trick human readers into thinking it was a vmovq zero-extending load of a single epi64.

Intel's intrinsics finder does specify that it exists, but even current trunk gcc (on Godbolt) doesn't define it.

Almost all AVX512 instructions support merge-masking and zero-masking. Instructions that used to be purely bitwise / whole-register with no meaningful element boundaries now come in 32 and 64-bit element flavours, like vpxord and vpxorq. Or vmovdqa32 and vmovdqa64. But using either version with no masking is still just a normal vector load / store / register-copy, and it's not meaningful to specify anything about element-size for them in the C++ source with intrinsics, only the total vector width.

See also What is the difference between _mm512_load_epi32 and _mm512_load_si512?


SSE* and AVX1/2 options are irrelevent to whether or not GCC headers define this intrinsic in terms of gcc built-ins or not; -mavx512f already implies all of the Intel SSE/AVX extensions before AVX512.


It is present in clang trunk (but not 7.0 so it was only very recently added).

  • unaligned _mm512_loadu_si512 - supported everywhere, use this
  • unaligned _mm512_loadu_epi64 - clang trunk, not gcc.
  • aligned _mm512_load_si512 - supported everywhere, use this
  • aligned _mm512_load_epi64 - also supported everywhere, surprisingly.
  • unaligned _mm512_maskz_loadu_epi64 - supported everywhere, use this for zero-masked loads
  • unaligned _mm512_mask_loadu_epi64 - supported everywhere, use this for merge-mask loads.

This code compiles on gcc as early as 4.9.0, and mainline (Linux) clang as early as 3.9, both with -march=avx512f. Or if they support it, -march=skylake-avx512 or -march=knl. I haven't tested with Apple Clang.

#include <immintrin.h>

__m512i loadu_si512(void *x) { return _mm512_loadu_si512(x); }
__m512i load_epi64(void *x)  {  return _mm512_load_epi64(x); }
//__m512i loadu_epi64(void *x) {  return _mm512_loadu_epi64(x); }

__m512i loadu_maskz(void *x) { return _mm512_maskz_loadu_epi64(0xf0, x); }
__m512i loadu_mask(void *x)  { return _mm512_mask_loadu_epi64(_mm512_setzero_si512(), 0xf0, x); }

Godbolt link; you can uncomment the _mm512_loadu_epi64 and flip the compiler to clang trunk to see it work there.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Forgive my ignorance... When you say *"with no masking"*, what do you mean? – jww Dec 04 '18 at 03:28
  • Without an AVX512 mask argument. The whole point of having different element-size versions of vector loads is for use with a mask register, like `_mm512_maskz_loadu_epi64(0x55, x);` that zeros the odd elements for free while loading. Like `vmovdqu64 (%rdi), %zmm0{%k1}{z}`. When elements are loaded into the destination unchanged, element boundaries are meaningless. That's why AVX2 and earlier don't have them for bitwise booleans (like `_mm_xor_si128`) and loads/stores like `_mm_load_si128`). – Peter Cordes Dec 04 '18 at 03:32
  • And it's why asm instructions like `vmovdqu64/32/16/8` exist. http://felixcloutier.com/x86/MOVDQU:VMOVDQU8:VMOVDQU16:VMOVDQU32:VMOVDQU64.html Notice the `{%k1}{z}` decoration in the above AT&T syntax asm instruction. That means to assemble with the EVEX prefix encoding mask-register = k1, and zero-masking (not merge-masking). For most instructions, the no-masking encoding is what would have meant `k0`, so there's an extra mask register you can use as a destination for compare-into-mask or `kshift` / `kunpack` / `ktest` / `kortest` / etc. instructions. – Peter Cordes Dec 04 '18 at 03:33
  • Thanks. I was not aware a mask was needed to load a register with a value from memory under AVX-512. – jww Dec 04 '18 at 03:36
  • @jww: it isn't needed. (just finished editing my last comment to mention that the encoding that would mean `k0` actually means "no mask" in those contexts). So you could say that there's always masking in machine code, but there's an architectural zero-register that you use for no-masking. But in asm text syntax, and C++ intrinsics, you can use the no-masking version of an instruction or intrinsic. If you aren't using a mask in the intrinsic that's purely bitwise, there's no need to care about different element-size versions of it. In asm you just need to pick one, in C++ the compiler can – Peter Cordes Dec 04 '18 at 03:39
  • Here's the other piece of the puzzle... When using the Clang integrated assembler, [Clang 7.0 is required](https://bugs.llvm.org/show_bug.cgi?id=39875#c4) due to a LLVM issue that was fixed in February, 2018. Also see LLVM Issue 39875, [Error: instruction requires: AVX-512 ISA when using -mavx512](https://bugs.llvm.org/show_bug.cgi?id=39875). – jww Dec 04 '18 at 05:15
  • @jww: You had a similar link in the question. Seems unrelated to choice of intrinsic; if that didn't assemble then presumably no AVX512F instructions would, because of some front-end bug leading to the assembler deciding not to accept AVX512F instructions. So you can change the test-case to use a portable intrinsic. – Peter Cordes Dec 04 '18 at 05:32
0

_mm512_loadu_epi64 is not available in 32-bit mode. You need to compile for 64-bit mode. In general, AVX512 works best in 64-bit mode.

A Fog
  • 4,360
  • 1
  • 30
  • 32
  • Which compiler is that true for? That sounds like a bug because `vmovdqu64` is available in either mode. (Agreed on using 64-bit mode so you have 32 instead of 8 registers, though! And so integer registers can hold a 64-element mask for AVX512BW.) – Peter Cordes Jun 17 '19 at 23:13