27

I'm trying to decide between two algorithms. One writes 8 bytes (two aligned 4-byte words) to 2 cache lines, the other writes 3 entire cache lines.

If the CPU writes only the changed 8 bytes back to memory, then the first algorithm uses much less memory bandwidth: 8 bytes vs 192 bytes. If the CPU writes entire cache lines, then the difference between 128 and 192 bytes is less striking.

So how does a Intel Xeon CPU write back to memory? You'd be surprised how hard it is to find an answer in Google to something that should be well known.

As I understand it, the writes go into the store buffer, and then to the cache. They might only get written to memory when the dirty cache line is evicted from the cache, but does Intel track which parts of the cache line are dirty, or just dump the entire thing? I rather doubt that they track things below cache line granularity. I would also be very surprised if anything goes to memory before the cache line is evicted.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Eloff
  • 20,828
  • 17
  • 83
  • 112
  • 8
    Why the downvotes? This is a great question! – Stephan Dollberg Jul 25 '15 at 21:01
  • 4
    @inf One problem with the question is the use of the term "Intel Xeon CPU" doesn't make a useful distinction here. The Xeon trademark has been applied to Intel x86 CPUs since Pentium II architecture. Technically it doesn't really denote a different kind a processor, so much as it denotes a different kind of customer that the processors are marketed towards. By limiting the question just "enterprise-class" CPUs, it's less useful than one that just asked about Intel x86 CPUs generally. The answer is going to be the same either way. – Ross Ridge Jul 26 '15 at 02:34
  • 3
    @RossRidge Well, then simply ask for clarification to which architecture he is referring and don't go on a downvote spree. – Stephan Dollberg Jul 26 '15 at 06:39
  • @inf Downvoting one question is not a "spree'. And that's not how it works here. People are free to vote on questions as they're actually written. They don't have to guess or ask about what poster really meant. In any case, I don't actually know why anyone voted the way they did. I was just pointing a problem with the post that could be the reason why. – Ross Ridge Jul 26 '15 at 13:29
  • 5
    It seems that your main goal is to decide between two algorithms (based on performance). Is there a good reason not to just benchmark both algorithms? It might be a bit more work but it's guaranteed to give you exactly the information you need to make the choice. – Hugh Allen Jul 26 '15 at 14:08
  • 1
    @RossRidge I'm not old enough to remember Xeon's based on the Pentium II. I'll restrict my interest to Sandy Bridge and newer CPUs, since in a cloud services world, that's about as old as you'll find. I used Xeon in the title because more people know what a Xeon is than what a Sandy Bridge is. – Eloff Jul 26 '15 at 15:11

2 Answers2

18

Locality matters even for DRAM itself, even discounting caching. A burst write of 64B contiguous bytes for a dirty cache-line is a lot faster than 16 writes of 4B to 16 different addresses. Or to put it another way, writing back an entire cache line is not much slower than writing back just a few changed bytes in a cache line.

What Every Programmer Should Know About Memory, by Ulrich Drepper, explains a lot of stuff about avoiding memory bottlenecks when programming. He includes some details of DRAM addressing. DRAM controllers have to select a row, and then select a column. Accessing another virtual memory page can also cause a TLB miss.

DRAM does have a burst-transfer command for transferring a sequential chunk of data. (Obviously designed for the benefit of CPUs writing back cache lines). The memory system in modern computers is optimized for the usage-pattern of writing whole cache lines, because that's what almost always happens.

Cache lines are the unit at which CPUs track dirty-or-not. It would be possible to track dirtyness with a smaller line size than present-or-not cache lines, but that would take extra transistors and isn't worth it. The multiple levels of cache are set up to transfer whole cache lines around, so they can be as fast as possible when a whole cache line needs to be read.

There are so-called non-temporal reads/writes (movnti/movntdqa) that bypass the cache. These are for use with data that won't be touched again until it would have been evicted from the cache anyway (hence the non-temporal). They are a bad idea for data that could benefit from caching, but would let you write 4 bytes to memory, rather than a whole cache line. Depending on the MTRR for that memory range, the write might or might not be subject to write-combining. (This is relevant for memory-mapped i/o regions, where two adjacent 4B writes isn't the same as one 8B write.)

The algorithm that only touches two cache lines certainly has the advantage on that score, unless it takes a lot more computation, or especially branching, to figure out which memory to write. Maybe ask a different question if you want help deciding. (see the links at https://stackoverflow.com/tags/x86/info, esp Agner Fog's guides, for info that will help you decide for yourself.)

See Cornstalks' answer for warnings about the dangers of having multiple threads on different CPUs touching the same memory. This can lead to bigger slowdowns than just extra writes for a single-threaded program.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I forgot about non-temporal writes. With those I could really make the one algorithm write only 8 bytes, and it's not likely to be read again soon, so it could work. Now I have to give serious consideration to which algorithm to use. – Eloff Jul 26 '15 at 15:25
  • Cache will be a big speedup if it be *written* again soon, not just read. But it's worth giving it a try with `movnti` of an `int32_t` or `int64_t`. If the access pattern is sequential (even with a moderate-size stride), HW prefetching will bring data from DRAM into cache so writes will be fast. (`movnti` will defeat this). I think it's pretty rare for `movnt` to be a win, but this might be one of those cases. Ideally you can test both ways in the context of your full program, not a microbench. – Peter Cordes Jul 26 '15 at 16:31
  • 1
    @Eloff, non-temporal aren't useful in this scenario, if they can't combine to full 64B lines, they still won't write the partial bytes (imagine you have the line modified locally in some other core's cache - who is going to do the merge?). I think only some uncacheable memtypes would allow that, and the penalty is huge (probably snoops the line from all other cores first to avoid loss of coherency), it's not used for performance, only for functional purposes like devices/IO – Leeor Jul 28 '15 at 05:15
  • @Leeor: so what happens if a NT store buffer is flushed before filling a full cache line? A read-for-ownership like the first normal write to the cache-line would have generated, then merge and write to DRAM? I should update my answer if NT writes are only useful for performance when writing whole cache lines. – Peter Cordes Jul 28 '15 at 05:37
  • I would assume that a read-for-ownership is required to begin with, to avoid issues with how x86 has to maintain coherency (since non temporals may apply to WB mem type, which doesn't get any discounts as weaker types do). Non temporals still have advantages for performance, but not for partial lines - see http://stackoverflow.com/questions/30313600/why-does-my-8m-l3-cache-not-provide-any-benefit-for-arrays-larger-than-1m/30347796?s=1|1.1388#30347796 – Leeor Jul 28 '15 at 05:52
9

In order for the CPU to write only the dirty bytes back to memory, it would need to store a dirty bit for every byte in the cache. That's infeasible and isn't done on modern CPUs (as far as I know). CPUs only have one dirty bit for a cache line. Writing to any byte in the cache line causes the whole line to be marked as dirty.

When it comes time to flush the dirty cache line, the whole line needs to be written, because the CPU has no idea which byte(s) changed.

This can be seen in cache invalidation policies in which writing to one cache line in a core can invalidate the cache line in a different core (because the two cache lines map to the same address), even if the first core is using the lo-half of the cache line and the second core is using the high-half of the cache line. That is, if core 1 writes to byte N, and core 2 is using byte N+1, then core 2 still has to refresh its cache line even though you and I know that it's not necessary.

Cornstalks
  • 37,137
  • 18
  • 79
  • 144
  • I think Xeon's write 64 bits at a time on the memory bus? So one could imagine having 8 dirty bits per cache line to track which 8-byte word is dirty. I just don't think it's actually done. By the way, you just described why false-sharing is a thing with your cache invalidation example. Cores can end up ping-ponging a cache line back and forth even although they don't really need to conflict with each other. – Eloff Jul 25 '15 at 21:23
  • @Eloff: One could imagine having 8 dirty bits for a cache line, but I don't think that makes sense. The whole reason for having a cache line is to have a fundamental unit of spatial locality which the CPU operates on. If the cache line can be further divided up into individual blocks by the CPU, then it's not much of a cache line, is it? – Cornstalks Jul 25 '15 at 21:57
  • agreed, it doesn't make sense to me either – Eloff Jul 25 '15 at 22:16
  • The x86 simulator [PTLsim](http://goo.gl/T63TLL) implements a cache line with a valid mask. The manual says this allows a "no stall on store" policy whereby stores to non-resident cache lines can commit early and fetch the rest of the cache line without stalling the processor. It seems useful enough, although I don't know if any commercial processors do this. – hayesti Jul 27 '15 at 23:13