38

From here:

Instructions and data have different access patterns, and access different regions of memory. Thus, having the same cache for both instructions and data may not always work out.

Thus, it's rather common to have two caches: an instruction cache that only stores instructions, and a data cache that only stores data.

It's intuitive to know the distinction between instructions and data, but now I'm not show sure of the difference in this context? What constitutes as data and gets put into a data cache and what constitutes as instructions and gets put into an instruction cache?

I know ARM assembly. Would anything requiring STR, LDR, LDMF or STMFD use the data cache? But technically speaking STR, LDR, LDMF and STMFD are all instructions so I this is why I'm confused. Must "data" always exist with an "instruction"? Is data considered anything in the .data section?

For example LDR R1, =myVar then would LDR go into the instruction cache and the contents of myVar go into the data cache? Or does it not work like that?

Instructions and data have different access patterns Could someone please elaborate?

This comment I made on a helpful post highlights my difficulty understanding:

"The idea is that if an instruction has been loaded from memory, it's likely to be used again soon" but the only way to know the next instruction is to read it. That means a memory read (you can't say it's already in cache because a new instruction is being red). So I still don't see the point? Say a LDR instruction just happened, so now LDR is in the data cache. Maybe another LDR instruction will happen, maybe it won't, we can't be sure so we have to actually read the next instruction - thus defeating the purpose of cache.

artless noise
  • 21,212
  • 6
  • 68
  • 105
Celeritas
  • 14,489
  • 36
  • 113
  • 194

4 Answers4

24

Instruction fetches can be done in chunks with the assumption that much of the time you are going to run through many instructions in a row. so instruction fetches can be more efficient, there is likely a handful or more clocks of overhead per transaction then the delay for the memory to have the data ready then a clock per width of the bus for the size of the transaction. 8 words or instructions might be say 5+n+8 clocks for example, that is more efficient than one instruction at a time (5+1+1)*8.

Data on the other hand it is not that good of an assumption that data will be read sequentially much of the time so additional cycles can hurt, only fetch the data asked for (u p to the width of the memory or bus as that is a freebie).

On the ARMs I know about the L1 cache I and D are separate, L2 they are combined. L1 is not on the axi/amba bus and is likely more efficient of an access than the L2 and beyond which are amba/axi (a few cycles of overhead plus time plus one clock per bus width of data for every transaction).

For address spaces that are marked as cacheable (if the mmu is on) the L1 and as a result L2 will fetch a cache line instead of the individual item for data and perhaps more than a fetch amount of I data for an instruction fetch.

Each of your ldr and ldm instruction are going to result in data cycles that can if the address is cacheable go into the L2 and L1 caches if not already there. the instruction itself also if at a cacheable address will go into the L2 and L1 caches if not already there. (yes there are lots of knobs to control what is cacheable and not, dont want to get into those nuances, just assume for sake of the discussion all of these instruction fetches and data accesses are cacheable).

You would want to save instructions just executed in the cache in case you have a loop or run that code again. Also the instructions that follow in the cache line will benefit from the saved overhead of the more efficient access. but if you only execute through a small percentage of the cache line then overall those cycles are a waste, and if that happens too much then the cache made things slower.

Once something is in a cache then the next time it is read (or written depending on the settings) the cache copy is the one that is used, not the copy in slow memory. Eventually (depending on settings) if the cache copy of some item has been modified due to a write (str, stm) and some new access needs to be saved in the cache then an old one is evicted back to slow memory and a write from the cache to slow memory happens. You dont have this problem with instructions, instructions are basically read-only so you dont have to write them back to slow memory, in theory the cache copy and the slow memory copy are the same.

ldr r1,=myvar

will result in a pc relative load

ldr r1,something
...
something: .word myvar

the ldr instruction will be part of a cache line fetch, an instruction fetch (along with a bunch more instructions). these will be saved in I part of the L1 cache on an arm and the shared part of L2 (if enabled, etc). When that instruction is finally executed then the address for something will experience a data read, which if caching is enabled in that area for that read then it will also go into the L2 and L1 cache (D part) if not already there. If you loop around and run that instruction again right away then ideally the instruction will be in the L1 cache and the access time to fetch it is very fast a handful of clocks total. The data also will be in the L1 cache and will also be a handful of clocks to read.

The 5+n+8 I mentioned above, some number of clocks of overhead (5 is just a possibility, it can vary both by the design and by what else is going on in parallel). the N depends on the slower memory speeds. that n is quite large for dram, so the caches l2 and L1 are much much faster, and that is why the cache is there at all to reduce the large number of clock cycles for every dram access, efficient or not.

old_timer
  • 69,149
  • 8
  • 89
  • 168
  • 1
    If you look at the TRM for your arm processor (technical reference manual) and get the AMBA/AXI spec from arm, then you can see how many steps are involved in each type of access. see the axi bus control lines that indicate what kind of access, permissions, cacheable or not, etc. – old_timer Mar 14 '14 at 03:51
  • 1
    "instruction fetches can be done in chunks" is this another way of saying a cache line is typically larger than a single instruction? – Celeritas Mar 17 '14 at 03:45
  • Yes, cache lines are larger than a single instruction, what would be the point otherwise? – old_timer Mar 17 '14 at 03:53
11

The instruction cache would include cache lines fetched from memory for execution. The data cache would include cache lines fetched from memory for loading into a register as data.

Warren Dew
  • 8,790
  • 3
  • 30
  • 44
  • 2
    Could you give an example? For instance `LDR R1, =myVar` then would LDR go into instruction cache and the contents of myVar go into data cache? – Celeritas Mar 14 '14 at 02:10
  • Yes, that's what would happen. – Warren Dew Mar 14 '14 at 02:28
  • What's the point of caching the `LDR` instruction? Is it somehow faster to convert the opcode that way? – Celeritas Mar 14 '14 at 03:24
  • Any read from memory takes many times longer than a read from cache. The idea is that if an instruction has been loaded from memory, it's likely to be used again soon, so keeping it in cache will make use of it faster next time around, since we won't have to wait for a memory load. – Warren Dew Mar 14 '14 at 03:42
  • "The idea is that if an instruction has been loaded from memory, it's likely to be used again soon" but the only way to know the next instruction is to read it. That means a memory read (you can't say it's already in cache because a new instruction is being red). So I still don't see the point? Say a LDR instruction just happened, so now LDR is in the data cache. Maybe another LDR instruction will happen, maybe it won't, we can't be sure so we have to actually read the next instruction - thus defeating the purpose of cache. – Celeritas Mar 14 '14 at 09:19
  • 2
    @Celeritas The instruction cache does not cache opcodes but specific instruction. Caching instructions is particularly helpful in loops (where the same code is executed repeatedly), but even without loops calling a function from different places allows reuse of code. Caches also facilitate prefetching and exploiting spatial locality. –  Mar 14 '14 at 15:30
  • @PaulA.Clayton so you're saying that the `fetch` part of the fetch/decode/execute doesn't need to happen when the instruction is in cache and it is known to be the next instruction because it's in a loop? I mean after an instruction is executed, the program counter is incremented, then the next instruction is fetched. So you're saying there's scenarios where this fetch takes place in cache? – Celeritas Mar 14 '14 at 20:02
  • 2
    @Celeritas What I am saying is that with an Icache, instead of always having fetch [wait for memory]/decode/execute the processor more often has fetch [cache hit]/decode/execute whether because the code was recently executed (loop or recent function call) or because the successive instructions were "prefetched" because the cache is filled in chunks (x86 implementations often use 64B cache blocks). –  Mar 14 '14 at 21:11
  • I think I get it now, I didn't know the cache was filled in chunks, I thought just one instruction at a time. So the idea is [the control bus](http://en.wikipedia.org/wiki/Control_bus) has a capacity such that more than one instruction can be carried to the cache at a time? – Celeritas Mar 15 '14 at 23:04
  • @Celeritas It's actually the memory bus rather than the control bus, but basically yes as far as cache lines are concerned. Prefetch is slightly different - the memory controller may anticipate that when one chunk is fetched from memory to the instruction cache, the next chunk will be needed too, and the controller will go ahead and fetch it next, even before it is requested. – Warren Dew Mar 16 '14 at 01:45
  • Ok gotcha, so the memory bus is what instructions are sent over? – Celeritas Mar 16 '14 at 01:58
  • @Celeritas all data that is fetched from memory goes over the memory bus, including the instructions in programs. – Warren Dew Mar 16 '14 at 02:44
  • @WarrenDew ok thanks. This ties back to my other question about ambiguity in terminology with data and instructions. You see, you just referred to instructions as data? – Celeritas Mar 16 '14 at 02:47
  • 1
    @Celeritas I see what you are saying, though I actually just said the instructions are fetched from memory. Perhaps the answer to your other question is this: both instructions and data are stored in memory; on processors with separate instruction and data caches, instructions are fetched from memory into the instruction cache, while data is fetched from memory into the data cache. – Warren Dew Mar 16 '14 at 04:20
5

Instruction cache is just another level of memory to access instructions faster. It isn't part of the cpu's clockwork / internal parts / fetch-decode-execute logic - however you would name it.

When one say instruction is cached they mean, it is very close to cpu memory level wise so when cpu needs to load instruction at address X, it does that very fast compared to some other address Y.

CPUs internally doesn't cache instructions.

Instructions and data have different access patterns Could someone please elaborate?

For example it is frowned upon to update (overwrite) instructions and it is not common. So if you design a cache for instructions you can optimize it for reads. Also instructions are sequential, so if cpu accesses instruction at N, it is likely that it will access to instruction at N+1 too. However those two properties can't be as important for data caches, or data caches have to be much more detailed.

auselen
  • 27,577
  • 7
  • 73
  • 114
-1

Considering the e.g. ARM. For simplicity lets assume:

  • VA->PA is 1:1 mapped
  • MMU properties are set to cache-able and share-able
  • there is only L1I and L1D and L2D cache

When a core executes, PC has a VA (hereafter referred to as PA). At this point no one knows if the PA has a data or instruction. And also since this is the first time this address is hit, there won't be any cache allocations. So Hardware will look in L1 I cache, L1D and L2D and finds nothing.

It goes through the MMU walk ( MMU cannot find the address in TLB also) and finally fetches the content from the final memory. Now we have the content from that PA which can be data or instruction.

allocation in I cache:

Any data which is fetched based on the Address from PC is considered instruction and is automatically allocated in the I cache. Decode unit is not used yet. If the decode unit finds out that its invalid instruction then it will anyways abort and hence the abort/exception logic will evict/invalidate the instruction from the I cache. Also the prefetch engine can fetch the next few instructions with reference to the PA which was used above.

allocation in D cache:

Once the decode unit figures out its a load or store and passes the control to LoadStore unit , the data is allocated in/retrieved from L1D cache.

So the next time next address comes to PC and it follows through the same chain, the current address is referenced/checked with contents in L1I for instructions and wherever there is a hit, the current address content is fetched from the TLB