The usage of writemask k1 in AVX-512 VORPS?

Question

I am studying AVX-512. I have a question about VORPS.

The documentation says like this:

EVEX.512.0F.W0 56 /r VORPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst

Return the bitwise logical OR of packed single-precision floating-point values in zmm2 and zmm3/m512/m32bcst subject to writemask k1.

EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

Ref: https://www.felixcloutier.com/x86/orps

What does "subject to writemask k1" mean?

Can anyone give a concrete example of k1 contribution in this instruction?

I wrote this code to do some experiment about VORPS: https://godbolt.org/z/fMcqoa

Code

#include <stdio.h>
#include <stddef.h>
#include <stdint.h>

int main()
{
  register uint8_t *st_data asm("rbx");
  asm volatile(
    // Fix stack alignment
    "andq   $~0x3f, %%rsp\n\t"

    // Allocate stack
    "subq   $0x100, %%rsp\n\t"

    // Take stack pointer, save it to st_data
    "movq   %%rsp, %[st_data]\n\t"

    // Fill 64 bytes top of stack with 0x01
    "movq   %%rsp, %%rdi\n\t"
    "movl   $0x40, %%ecx\n\t"
    "movl   $0x1, %%eax\n\t"
    "rep    stosb\n\t"

    // Fill 64 bytes next with 0x02
    "incl   %%eax\n\t"
    "leaq   0x40(%%rsp), %%rdi\n\t"
    "movl   $0x40, %%ecx\n\t"
    "rep    stosb\n\t"

    // Take 0x1 and 0x2 to ZMM register
    "vmovdqa64  (%%rsp), %%zmm0\n\t"
    "vmovdqa64  0x40(%%rsp), %%zmm1\n\t"

    // Set write mask
    "movq   $0x123456, %%rax\n\t"
    "kmovq  %%rax, %%k0\n\t"
    "kmovq  %%rax, %%k1\n\t"
    "kmovq  %%rax, %%k2\n\t"

    // Execute vorps, store the result to ZMM2
    "vorps  %%zmm0, %%zmm1, %%zmm2\n\t"

    // Plug back the result to memory
    "vmovdqa64  %%zmm2, 0x80(%%rsp)\n\t"
    "vzeroupper"
    : [st_data]"=r"(st_data)
    :
    : "rax", "rcx", "rdi", "zmm0", "zmm1",
      "zmm2", "memory", "cc"
  );

  static const char *x[] = {
    "Data 1:", "Data 2:", "Result:"
  };

  for (size_t i = 0; i < 3; i++) {
    printf("%s\n", x[i]);
    for (size_t j = 0; j < 8; j++) {
      for (size_t k = 0; k < 8; k ++) {
        printf("%02x ", *st_data++);
      }
      printf("\n");
    }
    printf("\n");
  }

  fflush(stdout);

  asm volatile(
    // sys_exit
    "movl   $0x3c, %eax\n\t"
    "xorl   %edi, %edi\n\t"
    "syscall"
  );
}

Here, I tried to change the value of k0, k1, k2. But the result is always the same.

Result:
03 03 03 03 03 03 03 03 
03 03 03 03 03 03 03 03 
03 03 03 03 03 03 03 03 
03 03 03 03 03 03 03 03 
03 03 03 03 03 03 03 03 
03 03 03 03 03 03 03 03 
03 03 03 03 03 03 03 03 
03 03 03 03 03 03 03 03

You must put the code in the question (you can have a godbolt link **as well as** the code) and also this is not a C question because you are asking about assembly code. — user253751, Feb 24 '21 at 17:57
Also it seems that you have to tell it to use a k register (and which one). It doesn't automatically use one. Where is the part where you tell it to use a k register for the VORPS instruction? — user253751, Feb 24 '21 at 17:59
You've to specify the mask `%{%k1}%{z}` explicitly with the dest reg to get the masked form. — Hadi Brais, Feb 24 '21 at 19:18
There are intrinsics for these instructions which you can and should use instead of inline asm, like `_mm512_maskz_or_ps` (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,AVX_512,Other&expand=5322,4068&text=vorps). (If you want to know the AT&T syntax for using masking, use intrinsics and look at the compiler output.) — Peter Cordes, Feb 25 '21 at 01:28
Also, your inline asm modifies RSP so it's unusable as part of a normal program. It only just barely happens to work because you don't try to return, and you don't use many local vars. To make it safe, you'd do `alignas(64) uint8_t outbuf[0x100]` to get the compiler to align and reserve space for you. Then use that as a pointer input operand (with a memory clobber), or as a memory output operand (the whole array). [Looping over arrays with inline assembly](https://stackoverflow.com/q/34244185) — Peter Cordes, Feb 25 '21 at 04:09
@PeterCordes Ah yeah, today I learned. BTW, is it still unsafe to modify RSP by hand if I use `-fno-omit-frame-pointer`? — Ammar Faizi, Feb 25 '21 at 04:16
Oh no way, I will still take `uint8_t __attribute__((aligned(64))) buf[0x100];` way, instead of modifying RSP by hand. — Ammar Faizi, Feb 25 '21 at 04:26
Still unsafe. The compiler might choose to just `pop rbp` instead of `leave` if it didn't do any `sub rsp`, among other possible problems. (Including maybe referencing memory allocated with a VLA or alloca) — Peter Cordes, Feb 25 '21 at 04:26
Instead of GNU C `__attribute__((aligned(64)))`, just use ISO C++11 `alignas(64)`. (Also in ISO C11 via ``, otherwise `_Alignas(64)` without any headers in C.) — Peter Cordes, Feb 25 '21 at 04:27

Ammar Faizi · Accepted Answer · 2021-02-27T04:16:00.120

The reason of why mask register did not affect the result is because I did not encode the mask register in the destination operand for vorps.

In AT&T syntax, the usage is something like:

# Without z-bit (merge-masking)
vorps %zmm0, %zmm1, %zmm2 {%k1}

# With z-bit (zero-masking)
vorps %zmm0, %zmm1, %zmm2 {%k1}{z}

In GCC inline asm, the {} have to be escaped like this:

# Without z-bit
vorps %%zmm0, %%zmm1, %%zmm2 %{%%k1%}

# With z-bit
vorps %%zmm0, %%zmm1, %%zmm2 %{%%k1%}%{z%}

In that case, z-bit can be used to clear the value of destination operand.

With z-bit

For example, if before vorps operation the value of zmm2 is:

ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff

and the value of zmm0 and zmm1 are the same with the above case in the question.

After these instructions:

    // Set write mask
    "movq   $0b11111111, %%rax\n\t"
    "kmovq  %%rax, %%k1\n\t"

    // Execute vorps, store the result to ZMM2
    "vorps  %%zmm0, %%zmm1, %%zmm2 %{%%k1%}%{z%}\n\t"

    // Plug back the result to memory
    "vmovdqa64  %%zmm2, 0x80(%[buf])\n\t"

Then the result will be:

03 03 03 03 03 03 03 03 
03 03 03 03 03 03 03 03 
03 03 03 03 03 03 03 03 
03 03 03 03 03 03 03 03 
00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00

Without z-bit the result will be

03 03 03 03 03 03 03 03 
03 03 03 03 03 03 03 03 
03 03 03 03 03 03 03 03 
03 03 03 03 03 03 03 03 
ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff

Code example

Godbolt link: https://godbolt.org/z/4rq5M8

#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#include <stdalign.h>

int main()
{
  alignas(64) uint8_t buf[0x100];
  uint8_t *st_data = buf;

  asm(
    // Fill ZMM2 with 0xff garbage.
    "vpternlogd $0xff, %%zmm2, %%zmm2, %%zmm2\n\t"

    // Fill ZMM0 with 0x01
    "movl   $0x01010101, %%eax\n\t"
    "vpbroadcastd %%eax, %%zmm0\n\t"

    // Fill ZMM1 with 0x02
    "movl   $0x02020202, %%eax\n\t"
    "vpbroadcastd %%eax, %%zmm1\n\t"

    // Plug ZMM0 and ZMM1 value to memory to print later
    "vmovdqa64  %%zmm0, %[buf_0x00]\n\t"
    "vmovdqa64  %%zmm1, %[buf_0x40]\n\t"

    // Set write mask
    "movl   $0b11111111, %%eax\n\t"
    "kmovq  %%rax, %%k1\n\t"

    // vorps without z-bit (merge into ZMM2)
    "vorps  %%zmm0, %%zmm1, %%zmm2 %{%%k1%}\n\t"

    // // vorps with z-bit (zero-mask, overwrite ZMM2)
    // "vorps   %%zmm0, %%zmm1, %%zmm2 %{%%k1%}%{z%}\n\t"

    // Plug the result to memory
    "vmovdqa64  %%zmm2, %[buf_0x80]\n\t"

#ifndef __AVX__
    /*
     * Note:
     * If we pass -mavx or -mavx2 or -mavx512* and then we clobber
     * AVX register(s) with inline assembly, then the compiler will
     * yield "vzeroupper" after the inline assembly.
     *
     * So we should only put vzeroupper when there is no AVX flag
     * to prevent duplicate vzeroupper.
     */
    "vzeroupper"
#endif

    : [buf_0x00]"=m"(*(uint8_t (*)[0x40])(buf + 0x00)),
      [buf_0x40]"=m"(*(uint8_t (*)[0x40])(buf + 0x40)),
      [buf_0x80]"=m"(*(uint8_t (*)[0x40])(buf + 0x80))
      /*
       * Yes, it is all `*(uint8_t (*)[0x40])`, meaning we
       * are going to write 0x40 bytes for each constraint.
       */
    :
    : "rax", "zmm0", "zmm1", "zmm2", "k1"
  );

  static const char *x[] = {
    "Data 1:", "Data 2:", "Result:"
  };

  for (size_t i = 0; i < 3; i++) {
    printf("%s\n", x[i]);
    for (size_t j = 0; j < 8; j++) {
      for (size_t k = 0; k < 8; k ++) {
        printf("%02x ", *st_data++);
      }
      printf("\n");
    }
    printf("\n");
  }
  return 0;
}

Not exact duplicate, but [GNU C inline asm input constraint for AVX512 mask registers (k1...k7)?](https://stackoverflow.com/q/55946103) shows the right syntax, but for Intel-syntax (-masm=intel). But it includes letting the compiler pick registers instead of hard-coding it. Also points out missing zero-masking, which is a problem your answer shares. (You're using merge-masking, merging into whatever garbage was previously in ZMM2). Using intrinsics would make that problem obvious, because the merge-masking intrinsic takes an extra input (vector to merge into). — Peter Cordes, Feb 25 '21 at 04:23
Just FYI, there are much cheaper ways to implement `zmm0 = _mm512_set1_epi8(0x01)`. e.g. `mov $0x01010101, %%eax` / `vpbroadcastd %%eax, %%zmm0`. And `zmm2 = set1(-1)` (all ones) can be done with `"vpternlogd $0xff, %%zmm2, %%zmm2, %%zmm2"`. Or if you are going to load from memory, use `vpbroadcastd (%[buf]), %%zmm0` or a broadcast memory source operand like `vorps (%rdi){1to16}, %zmm1, %zmm2 {%k1}{z}` (extra % escaping omitted for readability). Also, for integer data like this it would be more idiomatic to use `vpord` (same masking granularity as `vorps`). — Peter Cordes, Feb 25 '21 at 11:02
And BTW, using `(%[buf])` defeats parts of the benefit of using a memory operand: the compiler needs the exact address in a register for your `"r"` constraint; it can't pick an addressing mode like `64(%rsp)` for you. Use `%0` (or name for the `"=m"` constraint) to let the compiler expand that way. You can do `64 + %0` and maybe get a warning from the assembler about something like `64 + (%rdi)` instead of `64 + 64(%rsp)`, but it still works. Of course, using intrinsics leaves this up to the compiler entirely, so normally you should do that. — Peter Cordes, Feb 25 '21 at 11:07
@PeterCordes I really love your advice, for this case I think we can use `[buf_0x00]"=m"(*(uint8_t (*)[0x40])(buf + 0x00))` / `[buf_0x40]"=m"(*(uint8_t (*)[0x40])(buf + 0x40))` / `[buf_0x80]"=m"(*(uint8_t (*)[0x40])(buf + 0x80))`. So the compiler will directly take the RSP for those constraints. — Ammar Faizi, Feb 25 '21 at 13:38
Yep, that looks right (including the output memory operand casts: deref a pointer-to-array to tell GCC exactly which memory you write). Nice example. — Peter Cordes, Feb 25 '21 at 13:41
Fun fact: an even more compact but less "obvious" way to set up the `0x01...` and `0x02...` constants would be `vpabsb %%zmm2, %%zmm0` (1 = abs(-1)), and `vpaddb %%zmm0, %%zmm0, %%zmm1` (2 = 1+1). Also, `kxnorb %%k0, %%k0, %%k1` sets up `k1 = 0x00ff`. [What are the best instruction sequences to generate vector constants on the fly?](https://stackoverflow.com/q/35085059). This is one part of this where you can actually beat the compiler, vs. using intrinsics. But for the purposes of this answer, keeping the constant setup simple and obvious avoids distraction from the masking. — Peter Cordes, Feb 25 '21 at 13:49
Yup, those all worked :) https://godbolt.org/z/Gb6r1W. GCC would probably have done something stupid like load 64 bytes from memory for `_mm512_set1_epi8(1)`, even after using `vpternlogd` for `_mm512_set1_epi8(-1)`. Clang sometimes sets up broadcast loads from narrower constants, but usually still insists on constant-propagation to create more crap it needs to load. — Peter Cordes, Feb 25 '21 at 13:54
There's no reason to use `movl` instead of `mov`. The EAX destination implies 32-bit operand-size. I usually only bother with a size suffix when it's needed (like `op $imm, mem`) since it doesn't particularly help readability. Good update about vzeroupper, though; the old version would have stepped on zmm3..15 if the compiler was using them for anything, without declaring clobbers. Since you're using 512-bit vectors anyway (so there's no code-size advantage to using smaller VEX-coded instructions), you could just use ZMM16, ZMM17, and ZMM18, and never need `vzeroupper`. — Peter Cordes, Feb 26 '21 at 04:12

The usage of writemask k1 in AVX-512 VORPS?

Code

1 Answers1

With z-bit

Without z-bit the result will be

Code example

Linked

Related