10

Can masking improve the performance of AVX-512 memory operations (load/store/gather/scatter and non-shuffling load-ops)?

Seeing as masked out elements don't trigger memory faults, one would assume that masking helps performance in those cases, however, what about the following if a 0 mask was used:

  • a load/store which crosses a cacheline boundary - would this suppress the cacheline cross penalty?
    • and suppress a load from L2 cache (or further away) if either or both cachelines aren't in L1?
    • does as masked out load affect memory reordering?
  • gather/scatter throughput seems to be limited by the CPU's load-store unit, but would masking off elements lessen the impact of this?

This would be in the context of current Intel processors at the moment, but would be interesting to see how an AVX-512 enabled AMD processor handles this.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
zinga
  • 769
  • 7
  • 17
  • 3
    Good question, I've wondered this myself. I doubt that masking could make a split-load as fast as a non-split load; it's probably processed in parallel, not checking the mask before address-generation and checking based on size. Especially for a 4k-split. But it's certainly plausible that a request to L2 doesn't happen. – Peter Cordes Aug 10 '22 at 21:49
  • 2
    AVX1/2 `vmaskmovps` on Skylake probably uses a similar implementation, and fault-suppression crossing into an unmapped page is slow. Or not writing to a read-only page, which can happen because of copy-on-write tricks by the OS, is very slow, microcode assist. ([SSE: does mask store affect the bytes that were masked out](https://stackoverflow.com/a/60372223) / [What does MaskStore do behind the scenes?](https://stackoverflow.com/a/72343459)). Masked stores are currently slow on AMD, so I'm curious how Zen4 implements that part of AVX-512. – Peter Cordes Aug 10 '22 at 21:57
  • 1
    Correction, I'm not sure *crossing into* an unmapped page is slow on Skylake with AVX `vmaskmovps` (some valid some invalid); what I remember from my test results is the all-zero-mask case being slow on a non-writeable page, so potentially bad for conditional update of an array if no replacements get done. (Also TODO: test on a writeable but clean page, to see if it leaves it clean and thus would have to take an assist every time to update the page-table bit). IIRC, there's some mention about some of this in Intel's optimization manual, also re: store-forwarding. – Peter Cordes Aug 11 '22 at 02:02

1 Answers1

1

I tried running some tests on an AVX-512 enabled Intel 12700K. I haven't done this before, so wouldn't be surprised if I messed something up.

I'm not sure how to test for L2 behaviour or reordering reliably, but for the rest, I took nanoBench and ran this script, yielding these results (CSV form).

Instructions tested:

  • Load
    • VMOVDQU8/64
    • VPADDB/Q (load-op)
    • VPEXPANDB/Q
    • VPMOVZXBD
  • Store
    • VMOVDQU8/64
    • VPCOMPRESSB/Q
    • VPMOVQW
  • VPGATHERDD & VPSCATTERDD

I can't see any difference based on mask value (0 or -1 tested) for loads, however there may be a slight difference for stores. Not entirely sure what CORE_CYCLES means, but it's one cycle less for the 0 mask compared to a -1 mask.
This behaviour seems consistent across the store instructions tested, with the load+store test of VMOVDQU64 being the odd exception (difference of ~5 cycles). I'm not sure why, but result is repeatable. Cacheline crossing doesn't appear to be reason behind the difference either - testing masks such as 1, 2 and 128 seem to indicate that lower CORE_CYCLES can only be achieved with a 0 mask.

Gather/scatter is giving me identical results regardless of the mask or the number of cachelines the instruction would hit.

I think it's fair to assume that the mask value generally doesn't affect masked memory access (beyond perhaps suppressing faults). Maybe it has a minor impact on stores, but am unclear about this and could be uArch dependent.

zinga
  • 769
  • 7
  • 17
  • 2
    Thanks for doing some testing. This answer would be a lot more useful and self-contained if it described what nanoBench was testing, like whether your "always the same" was just for aligned 64-byte accesses, or whether this did also test cache-line splits and page-splits. And whether these tests are throughput and/or latency. (And what the numbers actually were for the things being tested, like 1/clock 64-byte store throughput?) There's one mention of testing cache-line crossing for `vmovdqu64`, but not clear whether you tested that in general. – Peter Cordes Aug 18 '22 at 20:35
  • Good questions, to which I sadly don't have the answers for. It answers what I wanted to know in the question, but more detailed analysis is certainly welcome if anyone wishes to provide it. – zinga Aug 18 '22 at 23:36
  • 1
    Some of the improvements I suggested would just be a matter of including info from the script you used into question itself, instead of just a link to a gist. And including some of the actual numbers from your testing so we can see what the aligned vs. unaligned penalty was, not just your summary that "it's the same" with or without masking. Seeing the numbers is useful to better understand what was being tested, given existing knowledge about CPU performance. – Peter Cordes Aug 19 '22 at 02:28
  • Appreciate the suggestions and explanation, but analysis isn't my forte, and I don't know what one might find to be relevant or not, hence my inclusion of everything so the reader can decide. Feel free to add clarifications as you see fit, and/or I can delete this answer if you find that to be more appropriate. If you need more info from this hardware and have a complete script, I can run it and post its output to help you get any info you need. – zinga Aug 19 '22 at 02:59
  • That's the thing; you didn't include everything *in your answer*, mostly just some off-site links that could rot. Only the briefest of summary is present in your actual answer, so it would be much better to include some representative numbers, e.g. from one load and one store case if the other cases are similar. – Peter Cordes Aug 19 '22 at 03:15