Let's say we have a processor with two cores (C0 and C1) and a cache line starting at address k
that is owned by C0 initially. If C1 issues a store instruction on a 8-byte slot at line k
, will that affect the throughput of the following instructions that are being executed on C1?
The intel optimziation manual has the following paragraph
When an instruction writes data to a memory location [...], the processor ensures that it has the line containing this memory location is in its L1d cache [...]. If the cache line is not there, it fetches from the next levels using a RFO request [...] RFO and storing the data happens after instruction retirement. Therefore, the store latency usually does not affect the store instruction itself
With reference to the following code,
// core c0
foo();
line(k)->at(i)->store(kConstant, std::memory_order_release);
bar();
baz();
The quote from the intel manual makes me assume that in the code above, the execution of the code will look as if the store was essentially a no-op, and would not impact the latency between the end of foo()
and the start of bar()
. In contrast, for the following code,
// core c0
foo();
bar(line(k)->at(i)->load(std::memory_order_acquire));
baz();
The latency between the end of foo()
and the start of bar()
would be impacted by the load, as the following code has the result of the load as a dependency.
This question is mostly concerned with how intel processors (in the Broadwell family or newer) work for the case above. Also, in particular, for how C++ code that looks like the above gets compiled down to assembly for those processors.