1

For example, we have a CPU with AVX512bw support. Now i want to run 3 types of string-length SIMD functions on this CPU.

  • The first function takes 16 bytes (AVX) of a string and search its characters for the null-terminator, and this continues until a null-terminator achieved.
  • The second function takes 32 bytes (AVX2) of a string and search its characters for the null-terminator, and this continues until a null-terminator achieved.
  • The third function takes 64 bytes (AVX512bw) of a string and search its characters for the null-terminator, and this continues until a null-terminator achieved.

But I can't understand that for AVX512 CPU, the whole 3 functions must uses AVX512 instructions or just use their SIMD instructions ? For example, for the first function, I have to use vmovdqa or vmovdqa16 !!! ??? Or for the second function, I have to use vmovdqa or vmovdqa32 !!! ???

Why there are such vmovdqa16, vmovdqa32 and ... instructions when we just can use their AVX or AVX2 instructions ??!!

Is it possible to use AVX, AVX2 instructions in a AVX512 function ?? Or we must use the AVX512 version of those instructions ?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
HelloGUI
  • 121
  • 7
  • 2
    *Why there are such vmovdqa16, vmovdqa32 and ... instructions when we just can use their AVX or AVX2 instructions ??!!* - So you can do per-element masking, and so you can access X/YMM16-31. If you don't want that, use `vmovdqa` for 16 or 32-byte vectors. There's zero problem using shorter VEX encodings for 16 or 32-byte operand-size and then using AVX-512 instructions with that same vector width (AVX-512VL) or 512-bit on load results from `vmovdqa` or `vmovdqu`. You can see this if you compile code with intrinsics using a mix of vector widths. – Peter Cordes Jul 31 '23 at 19:49
  • `vmovdqa16` -- does this exist? – harold Jul 31 '23 at 19:50
  • @harold: Actually no, for 8 and 16-bit element size, only alignment-not-required `vmovdqu8/16` exist. There are `vmovdqa32/64`, though. – Peter Cordes Jul 31 '23 at 19:52
  • 3
    Related: [What is the penalty of mixing EVEX and VEX encoded scheme?](https://stackoverflow.com/q/46080327) - no penalty. And it mentions an example like AVX1 `vpxor xmm0,xmm0,xmm0` being the most efficient way to zero a ZMM register. – Peter Cordes Jul 31 '23 at 19:53
  • AVX-512 CPUs can run SSE and AVX1 / AVX2 instructions. Otherwise they couldn't run existing binaries. It should be obvious that the first 2 cases can use AVX `vmovdqa` exactly like previous CPUs; backwards compatibility has historically been x86's main selling point. – Peter Cordes Jul 31 '23 at 19:59
  • Near duplicate: [What is the difference between \_mm512\_load\_epi32 and \_mm512\_load\_si512?](https://stackoverflow.com/q/53905757) explains the existence of `vmovdqa32` and `vmovdqa64` (masking), as well as the naming of intrinsics for them. Also semi-related, [How to load a avx-512 zmm register from a ioremap() address?](https://stackoverflow.com/q/60699914) showing that you need `vmovdqa32` (or 64) to load ZMM registers. (But doesn't address the question of using `vmovdqa ymm` on CPUs with AVX-512 support). – Peter Cordes Jul 31 '23 at 22:50

0 Answers0