28

Can someone please explain what do we gain by having a separate instruction cache and data cache. Any pointers to a good link explaining this will also be appreciated.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
ango
  • 829
  • 2
  • 10
  • 23
  • 1
    One is for data and one instructions: both may 'churn' at different rates, and have different access patterns. – Mitch Wheat Jan 03 '12 at 01:40
  • 3
    [From Wikipedia](http://en.wikipedia.org/wiki/CPU_cache): _"Instruction and data caches can be separated for higher performance with Harvard CPUs but they can also be combined to reduce the hardware overhead."_ So they're not **always** distinct. – Matt Ball Jan 03 '12 at 01:43
  • 1
    A fun tidbit here is that JIT can create issues by writing instructions out through the data cache, and either they aren't in memory/lower shared cache when it is time to retrieve the instructions, or the instruction cache may have a stale instruction. You have to manually take care of the synchronization. – rsaxvc Jan 03 '12 at 01:54

5 Answers5

31

The main reason is: performance. Another reason is power consumption.

Separate dCache and iCache makes it possible to fetch instructions and data in parallel.

Instructions and data have different access patterns.

Writes to iCache are rare. CPU designers are optimizing the iCache and the CPU architecture based on the assumption that code changes are rare. For example, the AMD Software Optimization Guide for 10h and 12h Processors states that:

Predecoding begins as the L1 instruction cache is filled. Predecode information is generated and stored alongside the instruction cache.

Intel Nehalem CPU features a loopback buffer, and in addition to this the Sandy Bridge CPU features a µop cache The microarchitecture of Intel, AMD and VIA CPUs. Note that these are features related to code, and have no direct counterpart in relation to data. They benefit performance, and since Intel "prohibits" CPU designers to introduce features which result in excessive increase of power consumption they presumably also benefit total power consumption.

Most CPUs feature a data forwarding network (store to load forwarding). There is no "store to load forwarding" in relation to code, simply because code is being modified much less frequently than data.

Code exhibits different patterns than data.

That said, most CPUs nowadays have unified L2 cache which holds both code and data. The reason for this is that having separate L2I and L2D caches would pointlessly consume the transistor budget while failing to deliver any measurable performance gains.

(Surely, the reason for having separate iCache and dCache isn't reduced complexity because if the reason was reduced complexity than there wouldn't be any pipelining in any of the current CPU designs. A CPU with pipelining is more complex than a CPU without pipelining. We want the increased complexity. The fact is: the next CPU design is (usually) more complex than the previous design.)

  • I meant complexity of the cache controller. – rsaxvc Jan 07 '12 at 05:25
  • Writes to I-cache aren't just rare, they're literally impossible in most CPU designs; it can be built as a read-only which doesn't need extra space in the tags to track whether data is "dirty" or not. And the ECC granularity can be as large as you want. (Of course data has to enter and leave by eviction on cache miss, and fetch from outer cache, so it does still need a "write port" for that) – Peter Cordes Oct 02 '20 at 06:34
  • Store-forwarding is done from the store buffer, not L1d cache. It would work the same whether L1 was split or unified. Also, "data forwarding network" normally refers to bypass forwarding from execution unit to execution unit (instead of waiting for write-back + register-read). The top of this answer is correct, though: the key reason is read and write ports: two smaller caches used in parallel are much cheaper to build than one larger cache with the sum total of read and write ports. – Peter Cordes Oct 02 '20 at 06:36
6

It has to do with which functional units of the CPU primarily access that cache. Since the ALU and FPU access the data cache which the decoder and scheduler access the instruction cache, and often pipelining allows the instruction processor and the execution unit to work simultaneously, using a single cache would cause contention between these two components. By separating them we lose some flexibility and gain the ability for these two major components of the processor to fetch data from cache simultaneously.

Dan
  • 10,531
  • 2
  • 36
  • 55
2

As processor's MEM and FETCH stages can access L1 cache(assume combined) simultaneously, there can be conflict as which one to give priority(can become performance bottleneck). One way to resolve this is to make L1 cache with two read ports. But increasing the number of ports increases the cache area quadratically and hence increased power consumption.

Also, if L1 cache is the combined one then there are chances that some data blocks might replace blocks containing instructions which were important and about to get accessed. These evictions and followed cache miss can hurt the overall performance.

Also, most of the time processor fetches instructions sequentially(few exceptions like taken targets, jumps etc) which gives instruction cache more spatial locality and hence good hit rate. Also, as mentioned in other answers, there are hardly any writes to the ICache(self-modifying code such as JIT compilers). So separate icache and dcache designs can be optimized considering their access patterns and other components like Load/store queues, write buffers etc.

user1669844
  • 703
  • 9
  • 23
  • Self-modifying code doesn't directly write to I-cache, it has to invalidate it. (manually with a special instruction, on most non-x86 ISAs where I-cache isn't coherent). L1I-cache is normally read-only, with tags not needing space for a dirty bit. (And not needing to support byte accesses.) And the only I-cache write port can be connected to fetch from L2, without needing to mux it with writes from the CPU core. See also [What does a 'Split' cache means. And how is it useful(if it is)?](https://stackoverflow.com/q/55752699) – Peter Cordes Oct 02 '20 at 06:43
2

One reason is reduced complexity - you can implement a shared cache that can retrieve multiple lines at once, or just asynchronously (see Hit-Under-Miss), but it makes the cache controller far more complicated.

Another reason is execution stability - if you have a known amount of icache and dcache, caching of data cannot starve the cache system of instructions, which may occur in a simplistic shared cache.

And as Dan stated, having them separated makes pipelining easier, without adding to the controller complexity.

rsaxvc
  • 1,675
  • 13
  • 20
-1

There are generally 2 kinds of architectures 1. von neuman architecture and 2. the harward architecture. The harward architecture uses 2 separate memories. you can get more on this on this arm page http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka3839.html