Separate instruction and data memory

Question

I am currently in a Computer Architecture class and this is the one thing majorly stumping me. I asked my professor why we have separate instruction and data memory (consider the single-cycle MIPS data path I'm attaching).

My thoughts:

add extra ports (not an issue of FU reuse, similar to register file implementation but with a port for instructions)
consolidate so that memory could be unified and not go unused

His:

agreed with me on last point
ports are quadratic negative increase in perf
separate allows more leeway in placement on chip
single-access memory is faster

Could anyone please elaborate on any of these points in more depth, or add anything of their own? I'm still not fully clear on this.

Erik Eidt · Answer 1 · 2023-02-06T23:51:11.930

1

If you think of the Instruction Memory and Data Memory as caches, as in being backed by a unified main memory, then you have the traditional Modified Harvard Architecture, which has some of the advantages of both the Von Neumann and the Harvard Architecture together.

One point you didn't seem to raise is that separation of the two memories (caches) allows for simultaneous access, so an instruction can be read while a data memory is read or written in the same cycle. This would be more difficult with a unified cache/memory. This advantage applies to single cycle and pipelined processors since in both designs there is overlap between instruction fetch (IF stage in pipelined) and memory operations (MEM stage in pipelined).

Further, as the Instruction Memory is read-only it has less circuitry. In the case of being caches, the IM has no dirty bits, no write back, etc.. Further, the IM and DM can have different associativity.

In the case of not being caches, it is not clear how the computer system loads the instruction memory, perhaps it is some fast ROM or is loaded by an external device from ROM into IM. A number of embedded systems have Instruction Tightly Integrated Memory (and/or Data memory ITIM/DTIM) that then do not act as caches and are not necessarily backed by main memory, instead serving as the primary memories.

edited Feb 06 '23 at 23:51

answered Feb 06 '23 at 23:44

Erik Eidt

23,049
2
29
53

Does a single-cycle MIPS need multi-ported memory for this? Doesn't it have to be finished code-fetch in order to decode and run a load or store? And doesn't a load have to finish before write-back of the result can happen at the end of the cycle for that instruction? (Stores don't have to finish, so if they're slower a store could still be settling when you want to start code-fetch, unless you just raise the cycle time). Anyway, couldn't you just access memory twice, like in the first and second half-cycles (rising / falling edge). – Peter Cordes Feb 07 '23 at 00:00
Or does subdividing the clock make it a "multi-cycle"? In which case you'd only want to have simplistic logic for driving two separate memory ports, even though access to them doesn't actually overlap in time? [single-cycle MIPS timeing questions](https://stackoverflow.com/q/24501672) / [Is a "single cycle cpu" possible if asynchronous components are used?](https://stackoverflow.com/q/63693436) – Peter Cordes Feb 07 '23 at 00:02
@PeterCordes, I don't know of any actual implementation of the single cycle MIPS processor — it may be only pedagogical, but who knows? Someone may have one somewhere as part of coursework. I'm of the opinion that using both clock cycle edges puts it in multi-cycle territory, though I don't view that as necessarily bad, just no longer pure single cycle. – Erik Eidt Feb 07 '23 at 01:00
I'm wondering if you have to subdivide the clock more than just rising/falling edge to access the same memory twice within one cycle. – Peter Cordes Feb 07 '23 at 01:30
@PeterCordes what you're discussing is a transparent latch (which is used for the pipelined implementation), where a write is done in the first half of the clock and the data is read in the second half – user18348324 Feb 07 '23 at 02:07
@ErikEidt you raised the point of simultaneous access, but if there is multiple read ports, with multiple signals, why would this be an issue? the register file is implemented in a way where multiple data can be grabbed from multiple ports. the memory could be similar, with one dedicated instruction port. – user18348324 Feb 07 '23 at 02:08
@user18348324: Correct, but multi-ported RAM is much more expensive. Split caches (or split RAM) gives you that for free, so you have to put "use standard parts" in the advantages side for split. – Peter Cordes Feb 07 '23 at 02:11
@PeterCordes as a slight expansion (I learned about this today), the way my professor explained it -> it's done using buffers to time it just right so that coherence is maintained (buffering the read to stall it until after the write is completed, timing the write to start at the rising edge -- or whatever clock edge the system uses ) – user18348324 Feb 07 '23 at 02:14

score 1 · Accepted Answer · answered Feb 06 '23 at 23:51

Yes, multi-ported DRAM is an option, but much more expensive, probably more than twice as expensive per byte. (And lower capacity per die area, so available sizes will be smaller).

In practice real CPUs just have split L1d/L1i caches, and unified L2 cache and memory, assuming it's ultimately a von Neumann type of architecture.

We call this "modified Harvard" - the performance advantages of Harvard allowing parallel code-fetch and load/store, except for contention for access to the unified cache or memory. But it's rare to have lots of code cache misses at the same time as data misses, because if you're stalling on code fetch then you'll have bubbles in the pipeline anyway. (Out-of-order exec could hide that better than a single single-cycle design, of course!)

It needs extra sync / pipeline flushing when we want to run machine code that we recently generated / stored, e.g. a JIT compiler, but other than that it has all the advantages of unified memory and the CPU-pipeline advantages of the Harvard split. (You need extra synchronization anyway to run recently-stored code on an ISA that allows deeply pipelined and out-of-order exec implementations, and which fetch code far ahead into buffers in the pipeline to give more room to absorb bubbles).

The first pipelined CPUs had small caches or in the case of MIPS R2000 even off-chip caches with only the controllers on-chip. But yes, MIPS R2000 had split I and D cache. Because you don't want code-fetch to conflict with the MEM stage of load or store instructions; that would introduce a structural hazard that would interfere with running 1 instruction per cycle when you don't have cache misses.

In a single-cycle design I guess your cycle would normally be long enough to access memory twice because you aren't overlapping code-fetch and load/store, so you might not even need multi-ported memory?

L1 data caches are already multi-ported on modern high-performance CPUs, allowing them to commit a store from the store buffer in the same cycle as doing 1 or 2 loads on load execution units.

Having even more ports to also allow code-fetch from it would be even more expensive in terms of power, vs. two slightly smaller caches.

So basically, it's less expensive to have two smaller, specialized caches. Could you give me some intuition on why multi-ported memory would be more expensive? — user18348324, Feb 07 '23 at 02:12
@user18348324: It's very rarely needed therefore the manufacturing volume isn't there. Also, for DRAM specifically, it's all about minimizing the number of transistors and wires per bit of storage. And reading a bit of DRAM requires refreshing it, so it doesn't naturally lend itself to multiple read ports. Apparently multi-ported DRAM is a thing for traditional video RAM (https://en.wikipedia.org/wiki/Dual-ported_RAM), but other than that you usually only find multi-ported SRAM. That still costs more area per bit, and more power per access, than two separate RAMs of similar total size. — Peter Cordes, Feb 07 '23 at 02:17
@user18348324: [What type of memory allows for most parallel read/write operations per clock cycle in an FPGA?](https://electronics.stackexchange.com/q/166966) from 2015 mentions some real-world memory types. I didn't find much detail about exactly why multi-ported RAM (especially DRAM) is less space-efficient or power-efficient; maybe if I'd looked for details of how it's typically designed. — Peter Cordes, Feb 07 '23 at 02:46

Separate instruction and data memory

2 Answers2