performance for writing the same value again into cache line

Question

I sometimes see optimized code like this:

if (matrix[t] != 0) {
    matrix[t] = 0;
}

As opposed to just this code:

matrix[t] = 0;

I suppose this code is written this way to reduce memory bandwidth in the CPU. Is this a good optimization on a typical CPU (when the value is likely to be 0 already) and why?

What does this mean for the MESI-state: Is there a state transition from e.g. shared to modified if I write the same value back into a cache line (write but no modification)? Or would this be too complicated for the CPU to detect?

Are typical CPUs (or at least some) optimizing anything about this case?

Reads can be slower than writes (processor has to wait for the round trip), so I dont see that as an optimization. (then adding more bandwidth by doing the write if the read produced something). The write wouldnt necessarily increase dram/slow memory bandwidth slowing everyone down until evicted from the cache. If the writes would have passed through/around the cache and not caused the space to be cached then sure this is an optimization. (write through). — old_timer, Oct 17 '17 at 14:26

score 4 · Accepted Answer · answered Oct 17 '17 at 14:44

AFAIK, no x86 microarchitectures attempt to commit a store from the store buffer to L1D by reading while it's still in Shared MESI state and checking if the value matches.

It's usually going to be rare, and only worth the extra cache-access cycles for hot shared variables, so it doesn't make sense for a microarchitecture to do it by default. Most stores are not to shared variables, and within the store buffer it doesn't know which stores are to shared variables or not.

In cases where that's worth doing (i.e. sometimes for shared variables), you have to do it yourself with code like the if() in the question. That is exactly what that code is for, and yes it can be a win.

It's a good idea to avoid writing shared variables if there's a good chance some other thread has read it more recently than you last wrote it, because it always Invalidates all other copies to get the local CPU's line into Modified state.

In some cases the cost of the load + branch mispredict might be higher than the saving, especially if it doesn't predict well. (A speculative RFO might even invalidate other copies before the mispredict is detected. Of course a speculative store can't actually commit to L1D, but the read for ownership can happen AFAIK.)

As another example, in the retry loop of a spinlock you always want to spin on a pure load (+ pause), not on xchg. Spinning on xchg or lock cmpxchg will keep hammering on that cache line and delay the code that's actually unlocking it.

Intel's optimization manual even suggests this optimization in the TSX chapter, to reduce transaction aborts in other threads that are acessing the shared variable by avoiding unnecessary stores.

// Example 12-1
state = true; // updates every time
var |= flag;

vs.

if (state != true) state = true;
if (!(var & flag)) var |= flag;

With TSX, transaction aborts have even higher costs than just extra waiting for MESI, so the chance of it being worth it is probably higher.

I wondered if the compiler could legitimately just transform this loop of check-then-set into a `memset` of 0 (i.e., eliminate the check, since that's probably profitable except in the unusual cases of sharing as you mention). It doesn't seem like [they do](https://godbolt.org/g/CBmA2A), however. You certainly can't do it on an arbitrary `int *` of unknown provenance since it could have been defined `const` and so writing (but not reading) would be UB. I think the data race rules also make introducing a write where there was none in the original program very difficult. — BeeOnRope, Oct 17 '17 at 19:49

performance for writing the same value again into cache line

1 Answers1