Cache Implementation in Pipelined Processor

Question

I have recently started coding in verilog. I have completed my first project, prototyping a MIPS 32 processor using 5 stage pipelining. Now my next task is to implement a single level cache hiearchy on the instruction set memory.

I have sucessfully implemented a 2-way set associative cache. Previously I had declared the instruction set memory as a array of registers, so whenever I need to access the next instruction in IF stage, the data(instruction) gets instantaneously allotted to the register for further decoding (since blocking/non_blocking assignment is instantaneous from any memory location).

But now since I have a single level cache added on top of it, it takes a few more cycles for the cache FSM to work (like data searching, and replacement policies in case of cache miss). Max. delay is about 5 cycles when there is a cache miss.

Since my pipelined stage proceeds to the next stage within just a single cycle, hence whenever there is a cache miss, the cache fails to deliver the instruction before the pipeline stage moves to the next stage. So desired output is always wrong.

To counteract this , I have increased the clock of the cache by 5 times as compared the processor pipelined clock. This does do the work, since the cache clock is much faster, it need not to worry about the processor clock.

But is this workaround legit?? I mean i haven't heard of multiple clocks in a processor system. How does the processors in real world overcome this issue.

Yes ofc, there is an another way of using stall cycles in pipeline until the data is readily made available in cache (hit). But just wondering is making memory system more faster by increasing clock is justified??

P.S. I am newbie to computer architecture and verilog. I dont know about VLSI much. This is my first question ever, because whatever questions strikes, i get it readily available in webpages, but i cant find much details about this problem, so i am here.

I also asked my professor, she replied me to research more in this topic, bcs none of my colleague/ senior worked much on pipelined processors.

Peter Cordes · Accepted Answer · 2019-05-11T13:08:20.090

But is this workaround legit??

No, it isn't :P You're not only increasing the cache clock, but also apparently the memory clock. And if you can run your cache 5x faster and still make the timing constraints, that means you should clock your whole CPU 5x faster if you're aiming for max performance.

A classic 5-stage RISC pipeline assumes and is designed around single-cycle latency for cache hits (and simultaneous data and instruction cache access), but stalls on cache misses. (Data load/store address calculation happens in EX, and cache access in MEM, which is why that stage exists)

A stall is logically equivalent to inserting a NOP, so you can do that on cache miss. The program counter needs to not increment, but otherwise it should be a pretty local change.

If you had hardware performance counters, you'd maybe want to distinguish between real instructions vs. fake stall NOPs so you could count real instructions executed.

You'll need to implement pipeline interlocks for other stages that stall to wait for their inputs to be ready, e.g. a cache-miss load followed by an add that uses the result.

MIPS I had load-delay slots (you can't use the result of a load in the following instruction, because the MEM stage is after EX). So that ISA rule hides the 1 cycle latency of a cache hit without requiring the HW to detect the dependency and stall for it.

But a cache miss still had to be detected. Probably it stalled the whole pipeline whether there was a dependency or not. (Again, like inserting a NOP for the rest of the pipeline while holding on to the incoming instruction. Except this isn't the first stage, so it has to signal to the previous stage that it's stalling.)

Later versions of MIPS removed the load delay slot to avoid bloating code with NOPs when compilers couldn't fill the slot. Simple HW then had to detect the dependency and stall if needed, but smarter hardware probably tracked loads anyway so they could do hit under miss and so on. Not stalling the pipeline until an instruction actually tried to read a load result that wasn't ready.

MIPS = "Microprocessor without Interlocked Pipeline Stages" (i.e. no data-hazard detection). But it still had to stall for cache misses.
An alternate expansion for the acronym (which still fits MIPS II where the load delay slot as removed, requiring HW interlocks to detect that data hazard) would be "Minimally Interlocked Pipeline Stages" but apparently I made that up in my head, thanks @PaulClayton for catching that.

Thank you sir for a great explanation, will further add a separate module to detect miss and add stall cycles. — Avisekh Ghosh, May 10 '19 at 13:08
was just wondering, do in real world does cache/memory systems are bottleneck for high speed current generation processor?? — Avisekh Ghosh, May 10 '19 at 13:11
@AvisekhGhosh: Yes, most programs bottleneck on memory bandwidth and/or latency, even with large amounts of die-area dedicated to per-core private L1d/i and L2 caches, and large shared L3 caches. Otherwise often on branch-misses or sometimes latency dependency chains. That's because modern CPU cores are so blazingly fast, like 4-wide superscalar with out-of-order execution. Even Skylake's large 224-entry reorder buffer isn't big enough to hide the latency of an L3 cache miss, especially in high-throughput code that runs 3 to 4 instructions per clock... — Peter Cordes, May 10 '19 at 13:15
@AvisekhGhosh: see http://www.lighterra.com/papers/modernmicroprocessors/, and [How can cache be that fast?](//electronics.stackexchange.com/a/329955) — Peter Cordes, May 10 '19 at 13:16
MIPS actually expands to "*Microprocessor* with(out) Interlocked Pipeline Stages" (the R4000 changed the pipeline so "without" became "with"). — , May 11 '19 at 12:56
@PaulA.Clayton: Thanks, apparently my memory distorted it. I was thinking that any stall mechanism that could stop the previous stages of the pipeline counted as interlocks, but apparently it specifically means data-hazard detection. — Peter Cordes, May 11 '19 at 13:05

Cache Implementation in Pipelined Processor

1 Answers1