BMI for generating masks with AVX512

Question

I was inspired by this link https://www.sigarch.org/simd-instructions-considered-harmful/ to look into how AVX512 performs. My idea was that the clean up loop after the loop could be removed using the AVX512 mask operations.

Here is the code I am using

void daxpy2(int n, double a, const double x[], double y[]) {
  __m512d av = _mm512_set1_pd(a);
  int r = n&7, n2 = n - r;
  for(int i=-n2; i<0; i+=8) {
    __m512d yv = _mm512_loadu_pd(&y[i+n2]);
    __m512d xv = _mm512_loadu_pd(&x[i+n2]);
    yv = _mm512_fmadd_pd(av, xv, yv);
    _mm512_storeu_pd(&y[i+n2], yv);
  }
  __m512d yv = _mm512_loadu_pd(&y[n2]);
  __m512d xv = _mm512_loadu_pd(&x[n2]);
  yv = _mm512_fmadd_pd(av, xv, yv);
  __mmask8 mask = (1 << r) -1;
  //__mmask8 mask = _bextr_u32(-1, 0, r);
  _mm512_mask_storeu_pd(&y[n2], mask, yv);
}

I thought using BMI1 and/or BMI2 instructions could generate masks with fewer instructions. However,

__mmask8 mask = _bextr_u32(-1, 0, r)

is no better (in number of instructions) than

__mmask8 mask = (1 << r) -1;

see https://godbolt.org/z/BFQCM3 and https://godbolt.org/z/tesmB_.

This appears to be due to the fact that _bextr_u32 does a shift by 8 anyway.

Can the mask be generated with fewer instructions (e.g. with BMI or another method) or more optimally?

I have augmented the table in the link with my results for AVX512.

ISA                           | MIPS-32 | AVX2  | RV32V | AVX512 |
******************************|*********|****** |*******|******* |
Instructions(static)          |      22 |   29  |    13 |     28 |
Instructions per Main Loop    |       7 |    6* |    10 |      5*|
Bookkeeping Instructions      |      15 |   23  |     3 |     23 |
Results per Main Loop         |       2 |    4  |    64 |      8 |
Instructions (dynamic n=1000) |    3511 | 1517**|   163 |    645 |

*macro-op fusion will reduce the number of uops in the main loop by 1
** without the unnecessary cmp instructions it would only be 1250+ instructions.

I think if the authors of the link had counted from -n up to 0 instead of from 0 to n they could have skipped the cmp instruction as I have (see the assembly below) in the main loop so for AVX it chould have been 5 instructions in the main loop.

Here is the assembly with ICC19 and -O3 -xCOMMON-AVX512

daxpy2(int, double, double const*, double*):
    mov       eax, edi                                      #6.13
    and       eax, 7                                        #6.13
    movsxd    r9, edi                                       #6.25
    sub       r9, rax                                       #6.21
    mov       ecx, r9d                                      #7.14
    neg       ecx                                           #7.14
    movsxd    rcx, ecx                                      #7.14
    vbroadcastsd zmm16, xmm0                                #5.16
    lea       rdi, QWORD PTR [rsi+r9*8]                     #9.35
    lea       r8, QWORD PTR [rdx+r9*8]                      #8.35
    test      rcx, rcx                                      #7.20
    jge       ..B1.5        # Prob 36%                      #7.20
..B1.3:                         # Preds ..B1.1 ..B1.3
    vmovups   zmm17, ZMMWORD PTR [rdi+rcx*8]                #10.10
    vfmadd213pd zmm17, zmm16, ZMMWORD PTR [r8+rcx*8]        #10.10
    vmovups   ZMMWORD PTR [r8+rcx*8], zmm17                 #11.23
    add       rcx, 8                                        #7.23
    js        ..B1.3        # Prob 82%                      #7.20
..B1.5:                         # Preds ..B1.3 ..B1.1
    vmovups   zmm17, ZMMWORD PTR [rsi+r9*8]                 #15.8
    vfmadd213pd zmm16, zmm17, ZMMWORD PTR [rdx+r9*8]        #15.8
    mov       edx, -1                                       #17.19
    shl       eax, 8                                        #17.19
    bextr     eax, edx, eax                                 #17.19
    kmovw     k1, eax                                       #18.3
    vmovupd   ZMMWORD PTR [r8]{k1}, zmm16                   #18.3
    vzeroupper                                              #19.1
    ret                                                     #19.1

where

    add       r8, 8
    js        ..B1.3

should macro-op fuse to one instruction. However, as pointed out by Peter Cordes in this answer js cannot fuse. The compiler could have generated jl instead which would have fused.

I used Agner Fog's testp utility to get the core clocks (not reference clocks), instructions, uops retired. I did this for SSE2 (actually AVX2 with FMA but with 128-bit vectors), AVX2, and AVX512 for three different variations of looping

v1 = for(int64_t i=0;   i<n;  i+=vec_size) // generates cmp instruction
v2 = for(int64_t i=-n2; i<0;  i+=vec_size) // no cmp but uses js
v3 = for(int64_t i=-n2; i!=0; i+=vec_size) // no cmp and uses jne

vec_size = 2 for SSE, 4 for AVX2, and 8 for AVX512

vec_size version   core cycle    instructions   uops
2        v1        895           3014           3524
2        v2        900           2518           3535
2        v3        870           2518           3035
4        v1        527           1513           1777
4        v2        520           1270           1777
4        v3        517           1270           1541
8        v1        285            765            910
8        v2        285            645            910
8        v3        285            645            790

Notice that core clocks is not really a function of the loop version. It only depends on iterations of the loop. It is proportional to 2*n/vec_size.

SSE     2*1000/2=1000
AVX2    2*1000/4=500
AVX512  2*1000/8=250

The number of instructions does change from v1 to v2 but not between v2 and v3. For v1 it is proportional to 6*n/vec_size and for v2 and v3 5*n/vec_size

Finally, the number of uops is more or less the same for v1 and v2 but drops for v3. For v1 and v2 it is proportional to 7*n/vec_size and for v3 6*n/vec_size.

Here is the result with IACA3 for vec_size=2

Throughput Analysis Report
--------------------------
Block Throughput: 1.49 Cycles       Throughput Bottleneck: FrontEnd
Loop Count:  50
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  0.5     0.0  |  0.5  |  1.5     1.0  |  1.5     1.0  |  1.0  |  0.0  |  0.0  |  0.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | vmovupd xmm1, xmmword ptr [r8+rax*8]
|   2      | 0.5         | 0.5  | 0.5     0.5 | 0.5     0.5 |      |      |      |      | vfmadd213pd xmm1, xmm2, xmmword ptr [rcx+rax*8]
|   2      |             |      | 0.5         | 0.5         | 1.0  |      |      |      | vmovups xmmword ptr [rcx+rax*8], xmm1
|   1*     |             |      |             |             |      |      |      |      | add rax, 0x2
|   0*F    |             |      |             |             |      |      |      |      | js 0xffffffffffffffe3
Total Num Of Uops: 6

IACA claims that js macro-fuses with add that disagrees with Agner and the performance counters from the testp utility. See above, v2 is proportional to 7*n/vec_size and v3 proportional to 6*n/vec_size which I infer to mean that js does not macro-fuse.

I think the authors of the link in additional to number of instructions should have also considered core cycles and maybe uops.

That link is a good summary of the problems with short-vector SIMD. Agner Fog's proposal for a new clean-slate ISA ([ForwardCom](https://forwardcom.info/)) also tries to address that, giving a different way for existing binaries to use wider vectors on future CPUs. But anyway, one part of their AVX prologue code is laughably bad. They use 3 instructions to get `a` broadcast to 256 bits, instead of using `vbroadcastsd ymm1, [stack arg]` and using the same register for both vector and scalar. I wonder if they originally wrote the code for x86-64 with a register arg and AVX1 (no reg-reg bcast)? — Peter Cordes, Feb 21 '19 at 17:53
Their inner loop also needs to be unrolled and avoid an indexed addressing mode for the FMA memory source, to run at 1 iteration per clock on HSW/SKL. So that's further bloat for efficient use of short-vector SIMD. A vector ISA is obviously great for stuff like DAXPY, but I wonder how well it does with higher computational intensity (where you can do multiple things with data while in registers), or when you need to shuffle e.g. to implement prefix sums or a video encoder's SAD motion search. Or [Fastest way to get IPv4 address from string](//stackoverflow.com/q/31679341) pmovmskb + pshufb — Peter Cordes, Feb 21 '19 at 18:00
`add`/`sub`/`cmp` can't macro-fuse with `js` [x86\_64 - Assembly - loop conditions and out of order](//stackoverflow.com/a/31778403). Use `jl` or `jle` instead. [Micro fusion and addressing modes](//stackoverflow.com/q/26046634) means your best bet is to increment a pointer as the FMA operand so it can stay micro-fused, and address the `vmovups` load/store relative to that with a pointer-different. (Because SKX can micro-fuse an indexed addressing mode for a store.) None of this matters on KNL, though: I think there's no unlamination, and no macro-fusion. — Peter Cordes, Feb 21 '19 at 18:07
`bextr` is 2 uops on Intel CPUs. It's pretty garbage most of the time, unfortunately, especially not being available in immediate form. `pdep` is 1 uop on Intel, *many* on AMD, so for constant bitfield extracts it's actually more efficient (but higher latency: 3 cycles). — Peter Cordes, Feb 21 '19 at 18:09
What about using the `kmovb` instruction to read the mask from a small lookup table?. The LUT size is only 8 entries of 8 bits. [`kmovb`](https://hjlebbink.github.io/x86doc/html/KMOVW_KMOVB_KMOVQ_KMOVD.html) can read directly from memory. — wim, Feb 21 '19 at 18:41
@wim: But do you really want a LUT at all? It can cache miss, so I don't see that being better than a xor-zero / `bts` / `dec` to create `(1< — Peter Cordes, Feb 21 '19 at 18:49
@PeterCordes: At least a LUT is less instructions (requested by Z boson) :). In the use case where `daxpy2` is called frequently with small `n`, cache misses will only have a negligible impact on the performance. Nevertheless, in this case there is already quite some pressure on the load ports. So, a computed mask, instead of a LUT load, might be preferred here, I admit. — wim, Feb 21 '19 at 20:08
@wim: Yeah it's a bit of a tradeoff between getting the first vector load into the back-end a cycle earlier. But `kmov k,[mem]` is 3 uops on SKX vs. `kmov k,r` being 1, presumably going through an integer load and then an ALU int->FPU transfer. If you actually care about performance instead of instruction count, you might want `xor`-zero / `bts` before the loop (2 front-end uops, and gives port0 or 6 something to do), and maybe the dec/kmov after. Assuming there's a branch miss on the loop exit, the back end will have caught up some with the front-end, but we still have time for some work... — Peter Cordes, Feb 21 '19 at 20:16
Unless `n` is a multiple of 8, your `for` loop will load and store past `x+n`/`y+n` (just try `n=1` ...) — chtz, Feb 21 '19 at 21:11
@PeterCordes: 3 uops is more than I expected. Note that icc does not generate a `kmov k,[mem]`, but an integer load followed by a `kmov k,r`, which is more efficient? — wim, Feb 21 '19 at 22:31
@wim: yeah, if Agner Fog's uop tables are right, a `mov` or `movzx` integer load and then `kmov k,r` will be fewer uops, but larger code-size. But one multi-uop instruction can be a downside for uop-cache packing... On average ICC's choice is probably the right one, 2 separate 1-uop instructions looks good to me. — Peter Cordes, Feb 22 '19 at 04:34
@chtz, thank you for pointing out that bug! The loop originally went from `[0,n2)` the new range should have been `[-n2, 0)`. I fixed the code now and tested it (see the edit). Fortunately, it does not change the structure of the code (e.g. number of instructions) or any of my other points. — Z boson, Feb 22 '19 at 08:15
@chtz, [here](https://godbolt.org/z/ok-1tn) is the result going from `[0,n2)`. It's only 24 instructions instead of 28. But it has the `cmp` instruction in the loop. I decided that four extra "bookkeeping" instructions were better than the `cmp` in the loop. — Z boson, Feb 22 '19 at 08:24
@PeterCordes, getting the compiler to not generate the `cmp` instruction is not easy. I managed (after asking [a question](https://stackoverflow.com/q/25921612/2542702)) to get GCC to do this in other example but [not in this case](https://godbolt.org/z/HJb-HY). — Z boson, Feb 22 '19 at 08:28
@PeterCordes, are you sure that the loop this loop cannot run at one iteration per cycle? This code is very similar to another [question I asked](https://stackoverflow.com/q/25899395/2542702). In that case, however, it needed to generate three addresses and use port 7. In this case only two addresses are needed (for `x` and `y`). So only two AGUs are needed instead of three. — Z boson, Feb 22 '19 at 08:34
@PeterCordes, in any case, I specifically avoided any mention of cycles. I am not sure why the authors of that link said SIMD was harmful. I think they were referring to energy usage and implying that energy usage is proportional to number of instructions. That seems like a very fuzzy argument to me. — Z boson, Feb 22 '19 at 08:38
@Zboson: oh right, even if the FMA didn't un-laminate into 2 uops (making 5 total), you'd bottleneck in the back-end on p23 because both loads and the store-address uop need the AGUs (and can't use p7 because of the indexed store). Even though it's the same address twice, they're separate uops in separate instructions, so there's no chance for them to reuse one AGU result. — Peter Cordes, Feb 22 '19 at 08:41
Their main argument is that it's not future-proof: Code compiled (or source hand-vectorized) for SSE 10 or 15 years ago can't take advantage of AVX or AVX512. And extensions to wider vectors have already given us 3 versions of most SIMD instructions in the ISA already. EVEX prefixes are 4 byte long! Short-vector SIMD has obvious major downsides, but it's not easily replaceable for stuff other than simply looping over arrays. — Peter Cordes, Feb 22 '19 at 08:45
@PeterCordes, what do you think of [this solution](https://godbolt.org/z/HTt43h) which generates the mask inside the loop. If the loop cannot run once per clock cycle anyway maybe it's worth generating the mask in the loop? This was actually my other goal I wanted to test with masks. And thanks to wim's answer I can do it now. — Z boson, Feb 22 '19 at 08:55
That's horrible unless you're optimizing for code-size / think the loop is probably not hot but want to auto-vectorize compactly just in case. 1 iter per 1.5 cycles is *much* better than 1 iter per 3 cycles saturating the front-end with 12 uops!! (More uops in the loop means fewer iterations fit in the ROB for lookahead / generating demand loads for future cache lines that HW prefetch didn't get, e.g. across a 4k boundary.) — Peter Cordes, Feb 22 '19 at 09:04
@PeterCordes, I made a long update to my question. I think you will find it interesting. IACA3 (and IACA2 - I checked them both as IACA3 only starts with Haswell now) says that `js` macro-fuses with `add` but the performance counters don't show this. — Z boson, Mar 04 '19 at 10:25

score 5 · Accepted Answer · answered Feb 21 '19 at 21:57

5

You can save one instruction if you use the following BMI2 intrinsic:

  __mmask8 mask = _bzhi_u32(-1, r);

instead of __mmask8 mask = (1 << r) -1;. See Godbolt link.

The bzhi instruction zeros the high bits starting at a specified position. With register operands, bzhi has a latency of 1 cycle and a throughput of 2 per cycle.

answered Feb 21 '19 at 21:57

wim

3,702
19
23

This solved another problem. I wanted to generate the mask inside the loop. Then no cleanup is necessary after the loop. See https://godbolt.org/z/HTt43h. The main problem was that `(1 << r)-1` is limited to r%32 and `_bextr_u32(-1,r)` is `r%256` but `_bzhi_u32` does not seem to have this problem. The loop is long now though. – Z boson Feb 22 '19 at 08:53
GCC solution generating mask in the loop is also short https://godbolt.org/z/vrGWwW – Z boson Feb 22 '19 at 08:59
1

Note that with `bzhi` also only the 8 least significant bits of the position are used according to Intel's Software Developer’s Manual: the operation starts with `N ← SRC2[7:0]`. – wim Feb 22 '19 at 09:18
3

I would further suggest to use a 64bit counter, and replace the loop-condition by `i!=0`. In case you knew that `n>=8`, you could actually just load and store some overlapping memory without having to generate a mask: https://godbolt.org/z/UV_hYy – chtz Feb 22 '19 at 09:24
@chtz, wooaaah! That's a great suggestion! You should consider making that an answer. That also fixes the `js` problem which does not fuse. – Z boson Feb 22 '19 at 09:36
@chtz, that's a clever trick using `size_t` since it's unsigned I would not have thought use `size_t i=-n2`. – Z boson Feb 22 '19 at 09:41
@chtz, it even gets GCC to do the right thing and not use `cmp` https://godbolt.org/z/jeigFw – Z boson Feb 22 '19 at 09:42
@chtz, appernetly `int` and `n2!=0` is sufficient to to remove `cmp` and generate `jne` for ICC but GCC needs `size_t` to remove the `cmp`. – Z boson Feb 22 '19 at 10:57

score 5 · Answer 2 · answered Feb 22 '19 at 19:52

Additionally to @wim's answer of using _bzhi_u32, instead of _bextr_u32, you should:

Mask the _mm512_loadu_pd instructions at the end, to avoid loading invalid memory (https://stackoverflow.com/a/54530225), or do arithmetic on non-finite values.
Use 64bit integers everywhere (actually either signed or unsigned) to avoid the movsxd sign-extension. This is generally a good advice on 64bit systems, unless you need to store a lot of index variables.
Use i!=0 instead of i<0 as loop condition to get a jne instead of js, since this better pairs with the add instruction: https://stackoverflow.com/a/31778403
Some minor things, instead of n2=n-r, you could also calculate n2 = n & (-8) or n2 = n ^ r. Not sure, if that makes a relevant difference (icc does not seem to know or to care about that). Godbolt-Link

void daxpy2(size_t n, double a, const double x[], double y[]) {
  __m512d av = _mm512_set1_pd(a);
  size_t r = n&7, n2 = n & (-8);
  for(size_t i=-n2; i!=0; i+=8) {
    __m512d yv = _mm512_loadu_pd(&y[i+n2]);
    __m512d xv = _mm512_loadu_pd(&x[i+n2]);
    yv = _mm512_fmadd_pd(av, xv, yv);
    _mm512_storeu_pd(&y[i+n2], yv);
  }
  __mmask8 mask = _bzhi_u32(-1, r);
  __m512d yv = _mm512_mask_loadu_pd(_mm512_undefined_pd (), mask, &y[n2]);
  __m512d xv = _mm512_mask_loadu_pd(_mm512_undefined_pd (), mask, &x[n2]);
  yv = _mm512_mask_fmadd_pd(av, mask, xv, yv);
  _mm512_mask_storeu_pd(&y[n2], mask, yv);
}

To further reduce the number of instructions, you can use pointer-incrementation, e.g., like this (this increases the instructions inside the loop however).

BMI for generating masks with AVX512

2 Answers2

Linked

Related