Can masking improve the performance of AVX-512 memory operations (load/store/gather/scatter and non-shuffling load-ops)?
Seeing as masked out elements don't trigger memory faults, one would assume that masking helps performance in those cases, however, what about the following if a 0 mask was used:
- a load/store which crosses a cacheline boundary - would this suppress the cacheline cross penalty?
- and suppress a load from L2 cache (or further away) if either or both cachelines aren't in L1?
- does as masked out load affect memory reordering?
- gather/scatter throughput seems to be limited by the CPU's load-store unit, but would masking off elements lessen the impact of this?
This would be in the context of current Intel processors at the moment, but would be interesting to see how an AVX-512 enabled AMD processor handles this.