How to force cpu core to flush store buffer in c?

Question

I have an application which has 2 threads , thread A affinity to core 1 and thread B affinity to core 2 , core 1 and core 2 are in the same x86 socket .

thread A do a busy spin of integer x , thread B will increase x under some conditions , When thread B decide to increase x , it invalidate the cache line where x located ,and according to x86 MESI protocal , it store new x to store buffer before core2 receive invalidate ack, then after core2 receive invalidate ack , core2 flush store buffer .

I am wondering , does core2 flush store buffer immediately after core2 receive invalidate ack ?! is there any chance that I can force cpu to do flush store buffer in c language ?! because thread A in core1 spining x should get x new value as early as possible in my case .

So.. ThreadA wastes time on CPUA for no reason while other unrelated processes pound the daylights out of CPUB, then ThreadB finally gets CPUB and increments the variable 1000 times while other unrelated processes pound the daylights out of CPUA; then ThreadA finally gets CPUA again and.. what? Ignores 999 of the increments, or spends ages without checking if the variable is incremented because it's trying to catch up the 1000 it missed? — Brendan, Jan 07 '19 at 11:02

Peter Cordes · Accepted Answer · 2020-06-30T06:28:55.993

10

A core always tries to commit its store buffer to L1d cache (and thus become globally visible) as fast as possible, to make room for more stores.

You can use a barrier (like atomic_thread_fence(memory_order_seq_cst) to make a thread wait for its stores to become globally visible before doing any more loads or stores, but that works by blocking this core, not by speeding up flushing the store buffer.

Obviously to avoid undefined behaviour in C11, the variable has to be _Atomic. If there's only one writer, you might use tmp = atomic_load_explicit(&x, memory_order_relaxed) and store_explicit of tmp+1 to avoid a more expensive seq_cst store or atomic RMW. acq / rel ordering would work too, just avoid the default seq_cst, and avoid an atomic_fetch_add RMW if there's only one writer.

You don't need the whole RMW operation to be atomic if only one thread ever modifies it, and other threads access it read-only.

Before another core can read data you wrote, it has to make its way from Modified state in the L1d of the core that wrote it out to L3 cache, and from there to the L1d of the reader core.

You might be able to speed this part along, which happens after the data leaves the store buffer. But there's not much you can usefully do. You don't want to clflush/clflushopt, which would write-back + evict the cache line entirely so the other core would have to get it from DRAM, if it didn't try to read it at some point along the way (if that's even possible).

Ice Lake has clwb which (hopefully) leaves the data cached as well as forcing write-back to DRAM. But again that forces data to actually go all the way to DRAM, not just a shared outer cache, so it costs DRAM bandwidth and is presumably slower than we'd like. (Skylake-Xeon has it, too, but handles it the same as clflushopt. I expect & hope that Ice Lake client/server has/will have a proper implementation.)

Tremont (successor to Goldmont Plus, atom/silvermont series) has _mm_cldemote (cldemote). That's like the opposite of a SW prefetch; it's an optional performance hint to write the cache line out to L3, but doesn't force it to go to DRAM or anything.

Without special instructions, maybe you can write to 8 other locations that alias the same set in L2 and L1d cache, forcing a conflict eviction. That would cost extra time in the writing thread, but could make make the data available sooner to other threads that want to read it. I haven't tried this.

And this would probably evict other lines, too, costing more L3 traffic = system wide shared resources, not just costing time in the producer thread. You'd only ever consider this for latency, not throughput, unless the other lines were ones you wanted to write and evict anyway.

edited Jun 30 '20 at 06:28

answered Jan 07 '19 at 10:28

Peter Cordes

328,167
45
605
847

For MESI, when some other CPU wants it it will request it and whoever has it will provide it, causing it to be bumped from "modified" to "shared". There's no reason to flush it, and something like `clflush` will cause the other CPU to fetch it all the way from slow RAM chips(!). Note that it's possible that both cores share L2 cache (e.g. Penryn) so for some CPUs it needn't reach LLC. – Brendan Jan 07 '19 at 11:14
That's exactly what I said about `clflush`. (except that it can't become "shared" everywhere until something actually writes it back to DRAM. I think it ends up in Modified in L3 after write-back from the first core in response to the read request, but then the reading core can also have a copy of it.) And BTW, in Core2 Merom / Penryn the L2 *is* the last-level cache. It doesn't have an L3. Unless there are models not listed in https://en.wikipedia.org/wiki/Penryn_(microprocessor) that glue on an L3 cache? – Peter Cordes Jan 07 '19 at 11:27
The "Xeon" variations of Penryn (but not the mobile & desktops) have 8 to 16 MiB of L3 added. – Brendan Jan 07 '19 at 11:40
@Brendan: oh right, I was looking at the "microprocessor" wiki page, not the Core uarch, https://en.wikipedia.org/wiki/Intel_Core_(microarchitecture)#Penryn/Wolfdale_(45_nm). https://en.wikipedia.org/wiki/Xeon#Harpertown even has two Penryn dies in one package (so for some pairs of cores, there's *no* level of cache they share, leading to bad inter-core latency I think I read.) But yeah, Dunnington added L3 cache. Anyway, I was basing my answer on Intel's modern i7 style cache hierarchy for simplicity. Triggering write-back from L1d to shared cache with `cldemote` would be nifty. – Peter Cordes Jan 07 '19 at 11:50
_maybe you can write to 8 other locations that alias the same set in L2 and L1d cache, forcing a conflict eviction._ Interesting idea. But it means some effort to implement. Even if I use `GetLogicalProcessorInformationEx` + `CACHE_RELATIONSHIP::CacheSize` + `CACHE_RELATIONSHIP::Associativity` on Windows instead of raw `_cpuid`. Am I understanding correctly that it is only useful if L3 cache is **not inclusive**, and on Skylake it is **not inclusive**, so it may make sense? – Alex Guteniev Jun 09 '20 at 09:45
1

@AlexGuteniev: This doesn't depend on L3 inclusivity. The lines you want to force write-back of are in Modified state in the private L1d cache of one core. Inclusive L3 doesn't mean the data stays in sync (that would mean write-through L1d and L2), just that it has to have a line with a tag that matches the line held by inner caches, so it can track it. Keep in mind that MESI is defined in terms of siblings, not parent/child, but inclusive / exclusive / NINE is usually discussed for a single core hierarchy; it can get confusing to keep track of / reason about. – Peter Cordes Jun 09 '20 at 09:54
@AlexGuteniev: [Can an inner level of cache be write back inside an inclusive outer-level cache?](https://stackoverflow.com/q/59450056). I think you were picturing "value inclusion", which would be totally impractical for performance. – Peter Cordes Jun 09 '20 at 09:55
@PeterCordes, I was thinking about _write to 8 other locations_, and decided against even trying it. One thing I'm afraid of, except the complexity of universal implementation, is the part that **all** aliased cache lines are evicted to L3. Say, for deep enough queue this will hit elements locations, that may already be prefetched to L1d of other core, or still prefetched by L1d of this core, all will unnecessary go back to L3. – Alex Guteniev Jun 30 '20 at 05:58
@AlexGuteniev: Good point, that's a huge downside of this strategy. You'd only consider if for latency reasons, not throughput, but yes it costs extra shared resources not just on the producer core. – Peter Cordes Jun 30 '20 at 06:26

score 2 · Answer 2 · edited Jan 07 '19 at 10:14

2

You need to use atomics.

You can use atomic_thread_fence if you really want to (the question is a bit XY problem-ish), but it would probably be better to make x atomic and use atomic_store and atomic_load, or maybe something like atomic_compare_exchange_weak.

edited Jan 07 '19 at 10:14

Peter Cordes

328,167
45
605
847

answered Jan 07 '19 at 04:24

nemequ

16,623
1
43
62

1

If you were going to use an atomic RMW, you'd use the `++` operator, or `atomic_fetch_add`. Not CAS, because in this case the HW already has atomic increment as an instruction. – Peter Cordes Jan 07 '19 at 10:22

How to force cpu core to flush store buffer in c?

2 Answers2

Linked

Related