1

Playing with avx2 intrinsics for the first time (on a system which supports avx2, but not avx512).

Neither from the prototype or the information I got from the intel intrinsics reference, would I assume, that _mm256_loadu_epi64 and _mm256_storeu_epi64 are avx512 functions.

But if I compile the code with only -mavx2, I get compiler errors. If, on the other hand I compile with -mavx512vl (as recommended by the compiler error), it compiles and seems to work. But of course I get nervous about what the compiler might do in the remainder of the program, if I opt for avx512...

Compiling as I think I should compile for my avx2 machine:

clang++ -std=c++17 -O2 -mavx2 -o storeload dummy.cpp
dummy.cpp:16:21: error: always_inline function
'_mm256_loadu_epi64' requires target feature 'avx512vl',
but would be inlined into function 'main' that is
compiled without support for 'avx512vl'
__m256i avx2reg = _mm256_loadu_epi64(&input[0]);
^
dummy.cpp:17:3: error: always_inline function
'_mm256_storeu_epi64' requires target feature 'avx512vl',
but would be inlined into function 'main' that is
compiled without support for 'avx512vl'
_mm256_storeu_epi64(&output[0],avx2reg);
^
2 errors generated.

Compiles but makes me nervous:

clang++ -std=c++17 -O2 -mavx512vl -o storeload dummy.cpp

Seems to work:

./storeload
0x1111111111111111 == 0x1111111111111111 ?
0x2222222222222222 == 0x2222222222222222 ?
0x3333333333333333 == 0x3333333333333333 ?
0x4444444444444444 == 0x4444444444444444 ?

The compiler is

clang --version
Debian clang version 11.0.1-2
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

The test code is

#include <cstdint>
#include <array>
#include <cinttypes>
#include <iostream>
#include <immintrin.h>

int main(int argc, const char* argv[]) {
  std::array<uint64_t,4> input
    { UINT64_C(0x1111111111111111),
      UINT64_C(0x2222222222222222),
      UINT64_C(0x3333333333333333),
      UINT64_C(0x4444444444444444) };
  std::array<uint64_t,4> output;
  output.fill(UINT64_C(0));

  __m256i avx2reg = _mm256_loadu_epi64(&input[0]);
  _mm256_storeu_epi64(&output[0],avx2reg);

  std::cout << std::hex << std::showbase;
  
  for (size_t i=0; i < input.size(); i++) {
    std::cout << input[i] << " == " << output[i] << " ?" << std::endl;
  }
  
  return 0;
}

Questions

  • Is it a compiler bug, asking for avx512 when only avx2 should do?
  • How do I make sure, the code (there is more code, not shown in this minimal example) will not crash on my avx2 system when I do enable avx512?
  • Are there alternate functions I could/should use instead?
  • Are there alternate -m flags I should use and have not found yet?
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
BitTickler
  • 10,905
  • 5
  • 32
  • 53
  • 1
    Just use [`_mm256_loadu_si256`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_loadu_si256&expand=3418) for an AVX(2) build. (The size/type of the elements is unimportant.) – Paul R Mar 18 '21 at 09:05
  • @PaulR Why is the prefix `_mm256` and not `_mm512` on the functions I use? – BitTickler Mar 18 '21 at 09:08
  • AVX512 adds new instructions which operate on 128 and 256 bit vectors as well as 512 bit vectors. The prefix indicates the width of the instruction operands, not the SIMD architecture. See the [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3418&techs=AVX_512). – Paul R Mar 18 '21 at 09:19

1 Answers1

2

Use _mm256_loadu_si256((const __m256i*) ptr) and _mm256_storeu_si256, and see also How to emulate _mm256_loadu_epi32 with gcc or clang?


Those intrinsics with nicer arg types (void* instead of __m256i*) were introduced with other AVX-512 intrinsics, but the most efficient way to do a 256-bit load is using AVX1 vmovdqu or vmovups (or a memory source operand for any instruction). That's why clang ends up making code that can run on your CPU. (Check the asm output with a disassembler or clang -march=native -O3 foo.cpp -S -o - | less)

It's unfortunate that clang doesn't even let you use the void* versions without enabling AVX-512VL, because they don't do anything that could only be implemented with AVX-512; only the masked versions of intrinsics for vmovdqu64 like _mm256_mask_storeu_epi64 really make sense, where the epi64 elements size has any meaning (the masking granularity).

It's not safe to use -mavx512vl if your CPU doesn't support that. (Skylake-X, Ice Lake, etc.). clang could have decided to actually use it, e.g. using ymm15..31 to avoid vzeroupper, or compile a pair of bitwise boolean intrinsics into vpternlogd, or fold a _mm256_set1_epi32 into a broadcast memory source operand for vpaddd (_mm256_add_epi32).

Or as a missed optimization (larger code-size), actually use vmovdqu64 instead of vmovdqu to load store ymm0..15. GCC has/had this bug for a while.

Why is the prefix _mm256 and not _mm512 on the functions I use?

The whole point of AVX-512VL (VL=Vector Length) is 128 and 256-bit versions of cool new stuff that AVX-512 introduced, like masked stores and masked register-writes, twice as many vector registers, broadcast memory source operands instead of needing a separate vpbroadcastd load, etc.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks a lot. Now I can do 4 64bit popcounts with avx2 (Mula's algorithm 2016) in parallel. I settled for the slightly ugly `_mm256_loadu_si256` (casts...) for now until I do some serious benchmarking. – BitTickler Mar 18 '21 at 12:12
  • @BitTickler: There's no "for now" about it - intrinsics code for AVX2 and earlier should still use the older intrinsics. I doubt many compilers are going to relax the target requirements for the `void*` versions, and they have silly misleading names like `epi64` that make it sound like a `vmovq` 8-byte load so you wouldn't want to use those intrinsics anyway. It would have been nice if compilers had simply changed `_mm256_loadu_si` to take a `void*` arg, but I guess that could have had some backward-compat issues for C++. – Peter Cordes Mar 18 '21 at 19:42
  • Right - at least on the Intel site you also linked to, they could add a remark to each function about the required -m settings. The filter menu on the left side does not seem to make that entirely and intuitively clear. – BitTickler Mar 18 '21 at 23:40
  • @BitTickler: IIRC, Intel's own compiler lets you use intrinsics without specifying `-m` at all, like MSVC. I don't know how it decides when it can optimize intrinsics into something more efficient, e.g. `_mm_and_epi32(x, _mm_or_epi32(y,z))` (SSE2) into `vpternlogd x,y,z, 0x??` (AVX-512VL), like presumably never to something you didn't specify in the target options. Or maybe it's just more faithful about not optimizing intrinsics; MSVC very rarely does (which is usually a poor model, although it's occasionally nice when gcc or clang end up making worse asm.) – Peter Cordes Mar 19 '21 at 02:15
  • @BitTickler: But Intel's documentation for `_mm256_loadu_epi64` is totally clear: it's an intrinsic for `vmovdqu64 ymm, m256`, and that instruction requires AVX512VL + AVX512F. https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,AVX_512,Other&expand=5322,3383&text=_mm256_loadu_epi64 (Note that different compilers specify target options differently, although the major ones than MSVC are fairly compatible for `-march=skylake-avx512` or whatever, which is what you should use to also set tuning options, not just `-mavx512f` .) – Peter Cordes Mar 19 '21 at 02:17