2

I have the following scenario and looking for suggestions please:

Need to share data between two threads, A and B each running in different cores in the same processor, where thread A writes to an instance of data structure S and B thread reads it. I need the sharing of S to be as consistent and as fast as possible.

struct alignas(64) S 
{ 
  char cacheline [64]; 
};

Planning to leverage the consistency of a cache line, being visible to other cores as an atomic update. Therefore have thread A write to S as fast as possible (*1) so the update is consistent (atomic from a visibility perspective) and then demote (CLDEMOTE instruction) the cache line to the shared cache so that thread B can read it as fast as possible.

Note 1: The reason why it needs to happen fast is so that when core running thread A starts writing to the cache line, it can update all of its contents completely and then core making it visible in L1 (updates occur in the core store buffer), otherwise if it takes too long to update a "mid-state" of the cache-line may be pushed to L1 incurring into unnecessaries invalidation signals (MESI) penalties (as it needs to do the rest again), and worst inconsistent state in thread B.

Are there better ways to achieve this?

Thanks!

AdvSphere
  • 986
  • 7
  • 15

1 Answers1

2

Yes, store then cldemote is a good plan. It runs as a nop on CPUs that don't support it, so you can use it optimistically. (Test that it actually helps your program on CPUs where it's not a nop, though, in case you accidentally demote before reading the line some more.)


Do you actually need atomicity, or is that just nice to have some of the time? If you need atomicity, you can't use separate store instructions. Coalescing in the store buffer isn't guaranteed; for L1d hits it may only sometimes happen on Ice Lake. And an interrupt can happen at any point (unless interrupts are disabled, but SMI and NMI can't be disabled). Including between two stores you were hoping would commit together.

32-byte AVX aligned loads and stores aren't guaranteed atomic, but in practice they probably are on Haswell and later (where the load/store units are 32 bytes wide).

Similarly, 64-byte AVX-512 loads and stores aren't guaranteed atomic, and very likely won't be in practice on Zen4 where they're done in two 32-byte halves. But they probably are on Intel CPUs with AVX-512, if you want to do some testing and find some "works in practice" functionality that doesn't show any tearing on the actual machine you care about.

16-byte loads/stores are guaranteed atomic on Intel CPUs that have the AVX feature flag. (Finally documented after being true for years, fortunately retroactive with an existing feature bit.) AMD doesn't document this yet, but it's probably true of AMD CPUs with AVX, too.

Related: https://rigtorp.se/isatomic/ / SSE instructions: which CPUs can do atomic 16B memory operations?

movdir64b will provide guaranteed 64-byte write atomicity, but only with NT semantics: evicting the cache line all the way to DRAM. It also doesn't provide 64-byte atomic read, so the read side would need to check sequence numbers or something, like a SeqLock.


Intel TSX (transactional memory) can let you commit changes to a whole cache line (or more) as a single atomic transaction. But Intel keeps disabling it with microcode updates. The HLE part (optimistic lock add handling) is fully gone, but the RTM part (xbegin / xend) can still be enabled on some CPUs, I think.


For a use case like this where one thread is only writing, you might consider a SeqLock, using 4 bytes of the cache line as a sequence number. Optimal way to pass a few variables between 2 threads pinning different CPUs / how to implement a seqlock lock using c++11 atomic library

The writer can load the sequence number, store seq+1 (with a plain mov store, no lock inc needed), store the payload with regular stores, or SIMD if convenient, then store seq+2.

Unfortunately without guarantees of vector load/store atomicity, or of ordering between parts of it, you can't have the reader just load the whole cache line at once, you do need 3 separate loads. (Seq number, whole line, then seq number again.)

But if you want to use 32-byte atomicity which appears to be true in practice on Haswell and Zen2 and later, maybe put a sequence number in each 32-byte half of a cache line, so the reader can check with vpcmpeqd / vpmovmskps / test al,1 to check that the first dword element (sequence number) matched between halves. Or maybe put them somewhere else within the vector to make reassembling the payload cheaper.

This spends space for two sequence numbers to save loads in the reader, but might cost more overhead in shuffling data into / out of vectors. I guess maybe store with vmovdqua [rdi+28], ymm1 / vmovdqu [rdi], ymm0 could leave you with 60 useful bytes starting at rdi+4, overwriting the 4 byte sequence number at the start of ymm1. Store-forwarding to a 32-byte load from [rdi+4] would stall, but narrower loads that don't span the boundary between the two earlier stores would be fine even.


Related Q&As about solving the same problem of pushing data for other cores to be able to read cheaply:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • What an amazing answer, thank you! This is exactly what I was looking for. Should have thought about the avx-512 registers (used them in the past)! did not know about the seqlock being used in linux itself. – AdvSphere Oct 08 '22 at 18:53
  • @AdvSphere: Note that any use of 512-bit registers will limit max turbo clock speed, so a seqlock might be better. Or limit your data structure to half a cache line if you want to rely on 32-byte atomicity in practice on Zen2 / Skylake. Hrm, or put a sequence number in each 32-byte half, so the reader can check with `vpcmpeqd` / `vpmovmskps` / `test al,1` to check that the first dword element (sequence number) matched between halves. – Peter Cordes Oct 08 '22 at 18:57
  • When it becomes available, MOVDIR64B will provide an atomic 64-byte write. – prl Oct 09 '22 at 00:19
  • @prl: Oh right, good point. But with cache-bypassing NT store semantics, making it probably not useful for the inter-thread communication use-case. Also no atomic 64-byte read. I guess using a sequence number the reader could still read separate parts to detect tearing, a bit like the read side of a SeqLock but the writer would only need to increment by 1 as part of the 64-byte store. – Peter Cordes Oct 09 '22 at 01:09