0

lets say in x86-64 arch, there are 2 cores, each core has a thread doing such thing: compare-and-swap a shared value(test the shared value if it's 0, change to 1), and then doing something else, after that, set the value to 0 again(in Loop), quite like a simple spinlock. I have a problem with that, if core-1 set the value to 1, core-2 is wait-busy(test the value), and then core-1 set the value to 0, cpu may doing such thing in timeline(when core-1 set val to 0):

time 0: core-1 set the new value to store buffer, and send "read invalidate" message to core-2
time 1: core-2 got msg and save it to invalidate queue, send ACK to core-1
time 2: core-1 got ACK flush store buffer
time 1.5 or 2.5 : core 2 flush invalidate queue

so if in time 0.5, core-1 read the value again, so it can get the newer data, but core 2 still got the dirty data, this is my guess, so will it happen just like this? if "yes", how to fix the problem? I don't think memory-barrier or LOCK bus may get anything help, additionally, does c++11 std::atomic value has such problem ?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Ryan Gao
  • 73
  • 5
  • This is not a real problem. Spinlocks work on x86 using `lock cmpxchg` to take the lock and `mov` store of `0` to release ([minimal example actually using `xchg` / `test`](https://stackoverflow.com/questions/37241553/locks-around-memory-manipulation-via-inline-assembly/37246263#37246263)), just like on every other modern mainstream ISA. (Although others may need extra barriers to give acquire and release semantics to the RMW and the store.) – Peter Cordes Dec 04 '21 at 15:14
  • 2
    I looked at this again; maybe you're asking about the fact that a core can see its own stores via store-forwarding before they become globally visible? Yes, x86's memory model is program order plus a store buffer with store-forwarding. (The store buffer alone allows StoreLoad reordering, and store forwarding creates other interesting effects.) In the rare case that's a problem, a full barrier (like any `lock`ed instruction, or `mfence`) drains the store buffer before any later load or store can happen. That's why std::atomic uses `xchg` (implicit lock) for stores with mo_seq_cst. – Peter Cordes Dec 05 '21 at 21:13
  • I read more about this question too, according to Intel dev manual "the LOCK# signal is generally not asserted. Instead, only the processor’s cache is locked. Here, the processor’s cache coherency mechanism ensures that the operation is carried out atomically with regards to memory." So I guess: hardware will use MESIF & cache locking to solve data consistency, but this case just effective only if value is in cache rather than in buffer(store buffer), so there must be some mechanism to flush store buffer & invalidate queue or dont use such things, but I cannot find anything to improve it – Ryan Gao Dec 06 '21 at 03:43
  • actually I dont know what "cache locking" will do, lock ring bus ? or detect/resolve conflict ? and/or bypass buffer ? – Ryan Gao Dec 06 '21 at 03:48
  • `mfence` or any `lock`ed instruction drain the store buffer before later loads (or of course stores). The invalidate queue, if it's a real thing at all on x86 CPUs, isn't allowed to cause any visible memory-reordering effects in the x86 memory model. – Peter Cordes Dec 06 '21 at 03:48
  • 1
    A "cache lock" is just delaying replies to MESI share / RFO / invalidate requests between the start of atomic RMW (its load) and when it commits a value to the local L1d. It's a purely local (to one core) thing that modern x86 CPUs do to make something like `xchg` or `lock add [mem], eax` atomic wrt. any cache-coherent outside observer. [Can num++ be atomic for 'int num'?](https://stackoverflow.com/q/39393850) – Peter Cordes Dec 06 '21 at 03:50
  • this make me much more clear, thanks again – Ryan Gao Dec 06 '21 at 03:50

0 Answers0