cache behaviour on redundant writes

Question

Edit - I guess the question I asked was too long so I'm making it very specific.

Question: If a memory location is in the L1 cache and not marked dirty. Suppose it has a value X. What happens if you try to write X to the same location? Is there any CPU that would see that such a write is redundant and skip it?

For example is there an optimization which compares the two values and discards a redundant write back to the main memory? Specifically how do mainstream processors handle this? What about when the value is a special value like 0? If there's no such optimization even for a special value like 0, is there a reason?

Motivation: We have a buffer that can easily fit in the cache. Multiple threads could potentially use it by recycling amongst themselves. Each use involves writing to n locations (not necessarily contiguous) in the buffer. Recycling simply implies setting all values to 0. Each time we recycle, size-n locations are already 0. To me it seems (intuitively) that avoiding so many redundant write backs would make the recycling process faster and hence the question.

~~Doing this in code wouldn't make sense, since branch instruction itself might cause an unnecessary cache miss (if (buf[i]) {...} )~~

That's a pretty big question that is highly architecture dependent. Here is an intel app-note about techniques which may be related to your task or will at least help you frame your question better: http://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems/ — msw, Jul 23 '10 at 05:55
I don't think that article answers my question. I've read numerous articles on caching but there's no mention of identical values anywhere. Specific info is much appreciated. — hawk, Jul 23 '10 at 15:14
I usually advocate that many performance questions *can* be answered by thinking - but this time I'd love to see results of some measurements. (Nudge nudge) — peterchen, Jul 28 '10 at 21:43
Although l1 caches are intimately related to cpus, this is a memory system/cache question not cpu question. See my response under rwongs answer, you either save nothing or lose a clock cycle as well as add unnecessary logic. http://github.com/dwelch67/amber_samples or go right to opencores and get the amber project. the amber25 version has a cache, open source, modify at will and see the differences. — old_timer, May 08 '12 at 04:28
the best you could do is within one clock cycle, combinationally compute an address into the data ram in the cache, combinationally compare the output to the value being written (have to deal with a lot of logic to handle the byte lane masking as well), combinationally determine if a write cycle is needed, if the cache line is byte addressable present the mask and data and write strobe to the cache if the write is needed otherwise dont assert the write strobe. a lot to do in a cycle — old_timer, May 08 '12 at 04:32
you probably want to use two clock cycles instead, start the ball rolling on one and finish on the other combinationally or read one cycle and write the other. — old_timer, May 08 '12 at 04:34
by asking this without a specific cpu in mind you realize that by changing cpus your performance changes dramatically. this optimization is so far down in the noise there is a long list of optimizations that are far better suited. Use a better compiler, program using a lower level programming language, dont run on an operating system, certainly not linux or windows. Just abandoning gcc and using a better compiler can often cover the performance difference. — old_timer, May 08 '12 at 04:37
reading your marked out question, understand how cache lines are evicted. There are different approaches, round robin, etc. In your case you want one that wants to avoid discarding a fresh, recently update line. if one cache line is being hit constantly by the threads, why would it get evicted towards dram? it could happily live in cache forever and never need to go to dram until the program finishes running...If you have a cache that you can mark a cache line to not get evicted, that is ideal. adding this other read/compare logic just makes no sense. — old_timer, May 08 '12 at 04:40
also if this is for example an x86 or anywhere where unaligned transfers are allowed (very very bad programming and design), it can take even more clock cycles as you have to read multiple locations in the data cache ram, then do your compare, then you are free to do the writes as needed. the logic uses blocks of ram behind caches and most things like this and those have a fixed width, if your write doesnt match that width perfectly you are adding cycles with the extra reads. the fastest thing you can do is blindly write and set the dirty bit. — old_timer, May 08 '12 at 04:47

Krazy Glew · Answer 1 · 2012-05-08T04:14:44.903

2

I am not aware of any processor that does the optimization you describe - eliminating writes to clean cache lines that would not change the value - but it's a good question, a good idea, great minds think alike and all that.

I wrote a great big reply, and then I remembered: this is called "Silent Stores" in the literature. See "Silent Stores for Free", K. Lepak and M Lipasti, UWisc, MICRO-33, 2000.

Anyway, in my reply I described some of the implementation issues.

By the way, topics like this are often discussed in the USEnet newsgroup comp.arch.

I also write about them on my wiki, http://comp-arch.net

edited May 08 '12 at 04:14

answered Mar 18 '12 at 06:58

Krazy Glew

7,210
2
49
62

Hey, here's a fun aside: most modern processors have writeback caches, but a few have writethrough caches, possibly in combination with writeback. E.g. the AMD Bulldozer L1 cache is WT, as are several IBM z-series, and many GPUs. Most such WT caches are accompanied by write combining or write coalescing. Such a write coalescing cache or buffer naturally eliminates "silent stores" - they get done to the L1$, but not written through to the L2$. – Krazy Glew May 08 '12 at 04:17
Skylake-client and Ice Lake had silent-store optimization for write-back of all-zero lines to L3, but it seems it was disabled by a microcode update in 2021, perhaps because of paranoia over timing attacks or something. We can't have nice things. [What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?](https://stackoverflow.com/q/47417481) / https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html / https://travisdowns.github.io/blog/2020/05/18/icelake-zero-opt.html – Peter Cordes Feb 12 '23 at 20:53

score 1 · Answer 2 · answered Jul 27 '10 at 04:43

Your suggested hardware optimization would not reduce the latency. Consider the operations at the lowest level:

The old value at the location is loaded from the cache to the CPU (assuming it is already in the cache).
The old and new values are compared.
If the old and new values are different, the new value is written to the cache. Otherwise it is ignored.

Step 1 may actually take longer time than steps 2 and 3. It is because steps 2 and 3 cannot start until the old value from step 1 has been brought into the CPU. The situation would be the same if it was implemented in software.

Consider if we simply write the new values to the cache, without checking the old value. It is actually faster than the three-step process mentioned above, for two reasons. Firstly, there is no need to wait for the old value. Secondly, the CPU can simply schedule the write operation in an output buffer. The output buffer can perform the cache write simutaneously while the ALU can start working on something else.

So far, the only latencies involved are that of between the CPU and the cache, not between the cache and the main memory.

The situation is more complicated in modern-day microprocessors, because their cache is organized into cache-lines. When a byte value is written to a cache-line, the complete cache-line has to be loaded because the other part of the cache-line that is not rewritten has to keep its old values.

http://blogs.amd.com/developer/tag/sse4a/

Read
Cache hit: Data is read from the cache line to the target register

Cache miss: Data is moved from memory to the cache, and read into the target register
Write
Cache hit: Data is moved from the register to the cache line

Cache miss: The cache line is fetched into the cache, and the data from the register is moved to the cache line

You do realize that on a multi-proc machine there needs to be a step 4 - evicting the value all the way to main memory. That step would take anywhere from 10-200 cycles. I imagine it might be worth it to save that extra step sometimes. — hawk, Jul 27 '10 at 15:22
@hawk: You already know more than I do. Please read http://www.drdobbs.com/architecture-and-design/208200273. Maybe you can consider asking Herb Sutter directly if he is aware of such optimizations. — rwong, Jul 28 '10 at 04:49
evicting out to dram is a lot more than 200 cycles, hundreds to thousands to get the value actually in dram. layers of caches likely as well, when you evict from l1, you hit l2 (if present) it has to go through all the same gyrations, do I have this thing, do I need to fetch a cache line, etc. — old_timer, May 08 '12 at 04:18
rwong has pretty much hit the nail on the head on this one, you dont save anything, you create unnecessary logic, which burns more power increases the risk of bugs in the logic in the implementation, etc. You actually use a clock cycle by doing the compare, blindly writing once you have a hit is faster, dirty or clean. Marking the dirty bit without bothering to compare what it used to be is also faster, just write, just mark the line dirty, done. Thats why caches are there, not to make things slower but faster. — old_timer, May 08 '12 at 04:24

score 0 · Answer 3 · answered Jul 25 '10 at 20:15

This is not an answer to your original question on computer-architecture, but might be relevant to your goal.

In this discussion, all array index starts with zero.

Assuming n is much smaller than size, change your algorithm so that it saves two pieces of information:

An array of size
An array of n, and a counter, used to emulate a set container. Duplicate values allowed.

Every time a non-zero value is written to the index k in the full-size array, insert the value k to the set container.

When the full-size array needs to be cleared, get each value stored in the set container (which will contain k, among others), and set each corresponding index in the full-size array to zero.

A similar technique, known as a two-level histogram or radix histogram, can also be used.

Two pieces of information are stored:

An array of size
An boolean array of ceil(size / M), where M is the radix. ceil is the ceiling function.

Every time a non-zero value is written to index k in the full-size array, the element floor(k / M) in the boolean array should be marked.

Let's say, bool_array[j] is marked. This corresponds to the range from j*M to (j+1)*M-1 in the full-size array.

When the full-size array needs to be cleared, scan the boolean array for any marked elements, and its corresponding range in the full-size array should be cleared.

I'm not looking for a technique in software - in software I can't know when something is in the cache or not. — hawk, Jul 26 '10 at 12:22
@hawk: My technique reduces the total number of writes needed to implement the same algorithm. So I thought it may be useful to you in a way that does not depend on a specific computer architecture. — rwong, Jul 27 '10 at 03:32

cache behaviour on redundant writes

3 Answers3