2

In a previous question, it's established that an AVX-512 masked load won't cause a page fault if the readmask bits are zero for each of the unmapped bytes.

Does the same apply for store forwarding failures? If a masked load comes immediately after a store, but the store only overlaps with bytes in the load for which the readmask bits are zero, will we see a store-forward penalty?

Under what circumstances does an AVX-512 masked load trigger a store-forwarding failure? Is it any different than a non-masked load of the same size?

I'm interested in Skylake-X in particular, but would appreciate any pointers!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Elliot Gorokhovsky
  • 3,610
  • 2
  • 31
  • 56
  • 1
    IIRC, Intel's optimization guide a couple years ago had a very small amount to say about masked store-forwarding, perhaps whether a masked store could forward to a vector reload if the mask was all-ones, or whether a vector load avoids a false dependency from masked elements. More detail will require some experiments, or data from experiments someone's already done. e.g. for stuff like store and reload with the same mask, if that can store-forward without waiting for a cache miss. (Testable by scattering stores around and checking lat vs. tput bounds, if load exec doesn't parallel fetch...) – Peter Cordes Oct 02 '22 at 23:56

0 Answers0