3

Cache coherency deals with read/write ordering for a single memory location in the presence of caches, while memory consistency is about ordering accesses across all locations with/without caches.

Normally, processors/compilers guarantee weak memory ordering, requiring the programmer to insert appropriate synchronisation events to ensure memory consistency.

My question is if programmers anyway have to insert these events, is it possible to achieve memory consistency without cache coherency in cache-based processors? What are the trade-offs here? As far as I know, GPUs don't have coherent caches, so, it should be indeed possible.

My intuition is that synchronisation events would become terribly slow without cache coherency, because whole caches might need to be invalidated/flushed at syncs, instead of specific lines getting flushed/invalidated continuously in the background through the coherency machinery. But I could not find any material discussing these trade-offs (and neither did chatGPT help ;)).

  • 2
    I'm not familiar with all GPU architectures but on Nvidia, the only cache that is non-coherent is the level 1 cache. That one is only used for read-only memory through specialized instructions or if the compiler knows that no other thread can access the data, e.g. register spilling, local variables, etc. It is flushed after every kernel call. – Homer512 Aug 07 '23 at 12:01
  • 1
    Well, one could require the programmer to flush the cache (or a cache line) every time a new value should be "published" to a shared variable. This way, the cache could be non-coherent and the programmer would have to explicitly handle the coherency. However, flushing data to memory is slow and broadcasting it to every other core may consume a lot of bandwidth for nothing if only a single other core would actually access it. – Margaret Bloom Aug 07 '23 at 12:19
  • @MargaretBloom Agreed, but the hope is that such flushes/broadcasts/invalidations would be rare and performed precisely when actually required, instead of the coherency protocol preemptively taking these actions for every cacheline. This is in the same vein of forcing memory ordering when it actually matters for memory consistency, instead of always guaranteeing sequential consistency. – Nimish Shah Aug 07 '23 at 12:50
  • @Homer512 Is theres a situation when memory consistency could not ever be guaranteed without cache coherency, even with special synchronization instructions? – Nimish Shah Aug 07 '23 at 12:52
  • 1
    Let's say you have two ints in an array. Thread A modifies the first, thread B modifies the second. Both want to publish their results. If your synchronization is simply to flush the cache line, this will not work. Both ints are probably in the same line, both threads have divergent views of that line. You would need to keep track which bytes in each cache line were modified and selectively publish only those. And since memory barriers follow regular memory operations, you have to do it for normal memory operations, not just specialized ones. At that point MESI sounds cheaper – Homer512 Aug 07 '23 at 13:08
  • So, basically, if caches would track the dirty status per byte, it would be possible to get functionally correct execution without cache coherency, but this is prohibitively costly (is it really, especially if it allows getting rid of the coherency machinery?). Instead, coherency allows a cheaper solution of tracking dirty status at cache-line granularity. – Nimish Shah Aug 07 '23 at 13:38
  • But note that coherency is not sufficient to ensure consistency anyway while modifying shared data, and synchronisations like locks/barriers are anyway needed. Two threads cannot modify the same location without a data race even with cache coherency. I am still not able to exactly articulate what it is that coherency brings to the table. The benefit could not be just tracking of dirty bits at line vs byte granularity, right? – Nimish Shah Aug 07 '23 at 13:39
  • 3
    Keep in mind that we are also sort of looking at it backwards. Memory fences and programming models that assume that regular memory accesses can be published via fences exist because they were built with MESI in mind. You could imagine different architectures that require specialized tracking of potentially shared access; maybe closer to transactional memory. Or how Rust prevents shared read-write access. – Homer512 Aug 07 '23 at 13:44
  • Indeed, a lot of legacy programming models might have assumed cache coherence. But then it just sounds like a legacy baggage instead of a fundamental functionality required for efficient and correct execution (something that comp arch textbooks should definitely emphasize!). – Nimish Shah Aug 07 '23 at 13:55
  • A slightly tangential but related question would be why is cache coherency required but not register file coherency?! Register allocation could easily create the same problems that cache coherency strives to solve. – Nimish Shah Aug 07 '23 at 13:57
  • I guess the issue might also be related to context switches. We really don't want to flush caches on a context switch (but are ok to flush register files)? GPUs don't have an OS and is not supposed to switch context as often as a CPU? – Nimish Shah Aug 07 '23 at 13:59
  • I wouldn't call MESI a legacy model. Nothing I've read in this discussion has convinced me the alternatives would be better. But this also looks like something that might have its own research field that I'm not acquainted to. Re. registers: Context switches already rare and costly, since you usually have to flush the TLB. And it can simply be handled like normal memory operations – Homer512 Aug 07 '23 at 14:26
  • 1
    @Homer512: In a programming model where cache lines only get "pushed" to other cores explicitly, you might program it using a model like MPI, where non-coherent shared memory is used for fast message passing, and the program explicitly indicates which cache lines should be flushed. Context switch (to another thread on this core) isn't a problem, but CPU migration is: on real CPUs, we need at least acq/rel sync between the old core and the new core to make sure a single thread sees *its own* operations after resuming. The OS doesn't know which lines, so without coherency would flush everything – Peter Cordes Aug 07 '23 at 17:38
  • @PeterCordes Good point. I guess you may justify the cost if it improves scalability and thereby core-count; which would reduce the need for migrations. The issue I see with retrofitting such an architecture is how you differentiate between shared and private memory access. Is it a specialized instruction? Then you need most functions like `memcpy` for both usages. Do you encode the usage pattern in the pointer (I think CUDA does this), or the page table (sort of like write-combining memory)? – Homer512 Aug 07 '23 at 18:09
  • @Homer512: I think there have been cluster supercomputers with non-coherent shared memory, but those aren't single-system-image machines. You don't boot one SMP Linux kernel across all nodes of a cluster like that, each machine runs its own kernel, so threads don't migrate across coherency domains. The shared memory (or remote-DMA?) is used only for message-passing via MPI, I think. (And cluster nodes do maintain cache coherency between their own cores.) Going further than that, yeah you'd need a new design and a suitable language to program it in, but IDK what that'd look like. – Peter Cordes Aug 07 '23 at 18:35
  • 2
    You might find it an interesting exercise to think about how you would implement the C++ memory model on a non-coherent architecture. An acquire barrier would have to invalidate the entire cache, and a release barrier would have to flush it. That's pretty expensive. And you've noticed the issue that C++ requires that concurrent accesses to *different* objects must not cause a data race, even if they happen to occupy the same cache line. So either cache lines are 1 byte, or small accesses require an atomic read-modify-write. Which, by the way, how do you do an atomic RMW? – Nate Eldredge Aug 13 '23 at 03:25
  • 2
    It seems that an atomic RMW has to load and store all the way to main memory, holding a bus lock or something while it does so. And before that, it has to somehow shoot down all other copies of the cache line that happen to be present in other cores. – Nate Eldredge Aug 13 '23 at 03:33
  • Agreed Nate, supporting existing memory models would be too expensive without coherency. But, existing models are designed while assuming coherency. My questions is would it be possible to design a non-coherency-friendly memory model that perhaps is a bit less intuitive and puts more responsibility on programmer, without sacrificing too much performance (for at least an interesting suite of workloads)? Seems like cluster supercomputers have tried it for niche HPC applications as per @Peter Cordes's comment. – Nimish Shah Aug 14 '23 at 10:56
  • For acquire-release barriers, the hope is that for many applications they are relatively rare. Would require a performance analysis to quantify the exact impact, but sounds research-worthy and not a slam dunk in favour of coherency. For concurrent accesses to different object, the model could require the objects to be always cache-line aligned. Bloats memory footprint, but doesn't sound like a fundamental issue, especially if most objects are sufficiently large. – Nimish Shah Aug 14 '23 at 11:01
  • Kernel code (including in interrupt handlers) often needs to access concurrent data structures, at least the way kernels are currently written. Sometimes maybe without ordering wrt. anything else, like `memory_order_relaxed`, and sometimes read-mostly, so there might be some ways around it, but a lot of stuff would have to be redesigned from the ground up to run a single-system-image OS and threads on a NUMA system that wasn't ccNUMA. – Peter Cordes Aug 14 '23 at 17:12
  • (Ping @NateEldredge since the OP replied without tagging you. Good points about non-conflicting writes to different bytes of the same line, and RMWs. I should mention those problems in the part of [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) where I discuss the fact that all real C++ implementations run on cache-coherent hardware, and that it would only be theoretically possible to run on non-coherent hardware with probably terrible performance.) – Peter Cordes Aug 14 '23 at 17:14

1 Answers1

1

There is a section on this in the book, "Parallel Computer Architecture: A Hardware/Software Approach" (perhaps a bit outdated).

Shared Address Space without Coherent Replication

Systems in this category support a shared address space abstraction through the language and compiler but without automatic replication and coherence, just like the CRAY T3D and T3E did in hardware. One type of example is a data parallel language like High Performance Fortran (see Chapter 2). The distributions of data specified by the user, together with the owner computes rule, are used by the compiler or run-time system to translate off-node memory references to explicit messages, to make messages larger, to align data for better spatial locality, and so on. Replication and coherence are usually left up to the user, which compromises ease of programming; alternatively, system software may try to manage coherent replication in main memory automatically. Efforts similar to HPF are being made with languages based on C and C++ as well (Bodin et al. 1993; Larus, Richards, and Viswanathan 1996).

A more flexible language- and compiler-based approach is taken by the Split-C language (Culler et al. 1993). Here, the user explicitly specifies arrays as being local or global (shared) and for global arrays specifies how they should be laid out among physical memories. Computation may be assigned independently of the data layout, and references to global arrays are converted into messages by the compiler or run-time system based on the layout. The decoupling of computation assignment from data distribution makes the language much more flexible than an owner computes rule for load-balancing irregular programs, but it still does not provide automatic support for replication and coherence, which can be difficult for the programmer to manage. Of course, all these software systems can be easily ported to hardware-coherent shared address space machines, in which case the shared address space, replication, and coherence are implicitly provided. In this case, the run-time system may be used to manage replication and coherence in main memory and to transfer data in larger chunks than cache blocks, but these capabilities may not be necessary.

The languages based-on C++ mentioned above are:

Parallel Programming in C**: A Large-Grain Data-Parallel Programming Language

Implementing a parallel C++ runtime system for scalable parallel systems

Parallel programming in Split-C

These look like predecessors of CUDA. So, lack of coherency perhaps make sense for massively-parallel workloads, for which the relatively slow synchronizations (due to a lack of coherency) could still account for only a tiny fraction of the overall runtime.

CRAY T3D and T3E indeed had a shared address space without hardware-supported coherency.