Cache coherence and its solution

Question

I am reading Cache coherence (http://simple.wikipedia.org/wiki/Cache_coherence, What's the point of cache coherency?). It is stated that

Cache coherence problems appears for processor having multiple cache memory.

My question is: Even we have multiple cache in a single processor. As kernel will allocate only one cache line as per page table of progress. Then why will Cache Coherence problem come and what is its solution?

It's not clear what you are asking. Can you rephrase your question ? — cnicutar, Sep 04 '13 at 07:44
As kernel will copy data and text to cache for fast accessing and will have one copy of data and text in cache memory, why does need to have multiple copies(data+text) to different cache memories to create cache coherence problem? — Embedded Programmer, Sep 04 '13 at 07:49

score 1 · Answer 1 · edited May 23 '17 at 11:43

You have misunderstood the function of the cache and how it is controlled.

First of all a cache can be enabled or disabled or (if it has write cache functionality) flushed under direct program (usually OS) control. The program may also direct the cache to preload (read) certain areas of memory because the program has better knowledge of what data it needs next than the cache does.

Apart from this the cache acts as a transparent, high-speed buffer or system of high-speed buffers between the processor core and RAM. If we assume a standard PC with DDRx RAM memory we have the problem that DDRx RAM is incapable of delivering data to the processor at anywhere near the rate that the processor can make use of it. In the same manner, the DDRx RAM cannot be written to at anywhere near the speed the processor is capable of writing to it so the cache buffers written data as well (how depends on the chosen cache write strategy design).

Typically a cache which experiences a processor (application) access to a RAM memory area will assume that the RAM immediately following it will also be accessed and preloads that into a cache line. When the application wants to access that data it is already in the cache and the program runs faster. If the program doesn't need it it means that the cache has loaded it unneccesarily, wasting time and memory system interface capacity which may impact later cache work.

If the processor requires data that is not in the cache you get a cache miss. The processor stops dead in its tracks until the data has been brought in from RAM which may mean that the processor does absolutely nothing for any number of CPU clock cycles, I assume the normal case is between 30 and 100 cycles.

Really fast applications - usually in the form of advanced computer games - would not be fast if they didn't make the utmost use of the cache in such ways as code organization (small, tight fast), data locality (data is as small as possible and not spread all over the place) and preloading whenever possible. At a higher level you need good design and algorithms but they are also more or less tied to the cache.

As you are an embedded programmer the situation is a bit different. Most embedded processors have RAM in the form of on-chip SRAM without wait states. This means that reading from and writing to SRAM is accomplished as fast as the processor wants, i e it won't ever need to stop dead in its tracks beacuse the SRAM keeps pace with it.

The processor also has on-chip FLASH memory which is much slower than the SRAM. To compensate for this the chip will have a read cache between the FLASH and the CPU so that most reads from FLASH (as it is read-only) will be performed without the processor having to wait for the data to arrive.

An embedded design may require more RAM than is available on the chip. In those situations external SDRAM or DDRx RAM chips are mounted on the card. Now you are back at the RAM situation I described for PCs where the external RAM cannot be accessed quickly enough. In addition external memory is usually accessed by a less than 32-bit wide data path which means that 32-bit or larger data entities will require two or more physical accesses before they arrive at the processor. In the meantime, the processor waits.

Back to your original question. An embedded processor's SRAM may be modified by both the processor or by peripherals (typically using DMA which the processor can't detect). Because SRAM isn't buffered by a cache (due to its speed) its contents are always up to date. If - on the other hand - you have externally mounted RAM with wait states then you need a synchronization function (called a BIU - Bus Interface Unit) to ensure that (processor and DMA) writes to it occur in a controlled manner. The BIU will perform all kinds of tricks to speed things up but, in the end, the BIU is not a cache and the processor will have to wait on it, slowing things down.

_____ Answer to the first comment _____

Cache coherence is a bit more complicated than that.

You should probably see cache coherence as something that has to do with maintaining a reasonably up-to-date copy of certain RAM areas in the caches. There are several ways by which a location in RAM may be updated. One is by any number of present cores which, for example in a massively parallel application, make read use of common memory areas while mmodifying others in a memory space they all share but hopefully don't all update at the same time.

It is easy to forget that not only cores update RAM. When a hard disk controller is ordered to read data into RAM it does so with a great deal of autonomy. It lines up the head on the correct disk track and waits for the disk to reach that position after which it starts reading. The data arriving from the disk is sent to a location in RAM. After this has completed the controller interrupts the operating system to notify it of the completion.

Physically the controller resides in the "Southbridge" component (which controls all the peripeherals) of the motherboard and sends data it reads from the drive to the "Northbridge" component which interfaces to the CPU(s), graphic controller and RAM. This description illustrates a design which applies to many processors but not all (AMD's Opteron being one).

So a core needs to be notified of any changes to RAM data at addresses which its own cache may have fetched to speed up the core's execution. The Northbridge is told by the controller where to write the data. As it does this it also notifies the (typically) L3 cache of where changes are occuring. L3 compares this and determines if any of its cache lines are affected. L3 also informs L2 which checks its lines and informs L1 which checks its lines. If a line or lines are affected the corresponding Lx cache marks the line(s) as invalid, freeing it up.

A multi-core processor will typically have a single generic L3 interfacing to the Northbridge and to the core-specific L2 caches. An L3 will send information about any updates to all of the connected L2s since only they know what they contain.

In a multi-core processor, multi-processor system the Northbridge will inform ALL L3s of a memory update. If one of the cores updates a RAM location the L3 will inform the L2s of the core's on-chip siblings. The Northbridge detects the update and informs the L3s of the other installed processors.

If the data in the newly invalidate cache lines are used often the caches will scurry to reload a new copy, clashing swords: not in L1 over L2 but L2 over L3 and L3s over RAM.

As you can understand the coherency work performed by the Northbridge and caches is significant, complex and time-consuming. Because it is complex and because of the hierarchical nature of the components involved there is a latency between when a RAM update occurs and when it has propagated into the affected components (caches).

What this means is that there is a limit to the cache coherency that can be achieved because what if a CPU fetches data from a cache which will be invalidated in a few cycles? It turns out that the cache coherency is a balance between acceptable coherency and total coherency. Why not total coherency? Total coherency would mean that the caches would have to stop the cores executing while the update propagates and in the end you would defeat the purpose for placing the cache system there in the first place: to minimize the cores being forced to wait for data from RAM.

I use the "training wheels" analogy: if you have training wheels (total cache coherency) on your bike you probably won't fall but you can't travel very fast because you can hardly steer. Take off the training wheels and you can go as fast as you like and avoid dangers because you can steer. On the other hand the results of wipeouts are much more drastic.

It is up to the programmer to handle the last little piece of synchronization. A program will (usually) not let a core read a memory location which is being updated by a block of data read from a disk. At any time a core may need to write to a shared memory location which will affect all other cores. On the x86 this is prepared for by asserting the bus lock signal using (typically) a form of the "xchg reg,mem" instruction. The signal tells the system that everyone must finish up what they are doing because it needs a known state. When the xchg instruction has completed and the result is successful (i e another bus lock wasn't in progress) the data is written and the bus lock released. I've written about it here and here.

The bus lock is no trivial thing. Regardless of whether it succeeds or not a bus lock requires an enormous number of CPU cycles to attempt: anywhere from 300 and up to perhaps 3000. This is the price you pay for not having total cache coherency: if you as a programmer come up with an efficient software synchronization scheme it will barely be noticed because you use it so seldom. The inexperienced programmer will play it safe and use it all the time and the resulting system will be slow. With experience she or he will learn that it is possible to "play it safe" in more or less intelligent ways.

The reason why the cores have L1 and L2 caches of their own is that they may be working on different data or in different programs. If they are working on the same thing they will clash when they attempt to read from the common cache. The L3 is the common cache for cores and that is where they clash. Before they get that far they will have (hopefully) been able to do a lot of useful, undisturbed and uninterrupted work in L1 and L2. I say "hopefully" because that depends on the programmer.

That is only about the cache working and its benefit, i read that cache coherence occurs only in multiple cache lines(it is possible on single or multiple processors who have separate cache memories), But i have confusion, why same process (text+data) resides on multiple cache memories? — Embedded Programmer, Sep 04 '13 at 10:15

score 1 · Answer 2 · answered Sep 16 '14 at 02:40

You may have multiple threads and or interrupt handlers inside a process. Then the CPU may hold values of single memory address into separate caches. Also external modules and drivers may share that memory resource with their own cached values. In this context, cache coherency problem would arise.

Cache coherence and its solution

2 Answers2