0

In intel intrinsics guide there are a few that allow to store parts of a wide register. I mean _mm_maskstore, _mm_mask_store and _mm_mask_compressstoreu like.

The question is, is it OK to use them if my thread doesn't own part of the cacheline where they'd land or it's past the end of the current page?

Example:

struct S {
  std::int16_t write_here[10];
  std::atomic<std::int16_t> other_thread_can_use_this;
};

Can I write with one simd store to write_here? Or it can corrupt the data from other_thread_can_use_this (by loading it and then writing that back again for example)?

Denis Yaroshevskiy
  • 1,218
  • 11
  • 24
  • 1
    Regarding AVX512: https://stackoverflow.com/questions/54497141/when-using-a-mask-register-with-avx-512-load-and-stores-is-a-fault-raised-for-i – chtz Feb 24 '20 at 07:23

1 Answers1

1

They do fault-suppression and maintain correctness; See When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements?

It definitely does not do a non-atomic RMW.

This all applies to SSE's (slow NT-store) maskmovdqu, AVX's relatively efficient dword/qword masked vmaskmovps/pd and vpmaskmovd/q, as well as AVX512 masked stores.

But it can be slow.

AVX vmaskmov fully-masked stores to read-only pages are very slow, taking a microcode assist for every instruction. (So perform very badly in a loop over an array doing if(a[i] == x) a[i] = y; if there are no changes needed, and the page was "clean" and COW mapped to a zero page.)

I'm not sure how it performs when the full vector splits across two cache lines in the same page, and one of them would miss in cache, but all the elements of that not-present line are masked out. You'd hope that that side of the store just wouldn't end up in the store buffer at all, so there'd be no reason for the core to RFO it (gain exclusive access to it).

Again, architecturally there's no effect on bytes that were masked out, only possible performance.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thank you! I was just looking for a way to deal with corner cases/edges in loops - I think for that masked stores should be great. – Denis Yaroshevskiy Feb 24 '20 at 20:06
  • @DenisYaroshevskiy: yes, the AVX and AVX512 masked stores are ok for that. But not the SSE bytemask version: it's an NT store that bypasses and evicts cache. See also [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](//stackoverflow.com/q/34306933) - often you can do better, e.g. doing one final unaligned vector that partially overlaps with the previous if the size isn't a multiple of the vector width. – Peter Cordes Feb 24 '20 at 20:09