-1

The newer ARM Architecture Reference Manuals don't give instruction timings any more. (Instruction timings were given, at least for the early ARM2 and ARM3 chips).

I know that cache misses result in external memory accesses that are very slow, compared with, say, data instructions like ADD x0, x1, x2 or BIC x0, x1, x2.

But how fast is a L1 cache hit?

If the answer is "it depends ..." what would be a rough guess (ballpark) figure?

Cache enabled (obviously). "Flat" memory mapping (ie. virtual address = physical address).

I suppose the answer also depends on the precise hardware being used. And that one should simply write test cases and measure the specific timings one's interested in...

I'm interested in the ARMv8 Raspberry Pi models -- which I don't possess. (I'm using QEMU).

I'd also be interested in any other timings, say, relative to:

ADD x0, xzr, xzr         ; == 1

ADD d0, d1, d2           ; floating-point

LDR x0, [x2]             ; L1 cache hit
LDR x0, [x2]             ; L1 cache miss, L2 cache hit
LDR x0, [x2]             ; L1 cache miss, L2 cache miss

LDP x0, x1, [x2]         ; L1 cache hit
LDP x0, x1, [x2]         ; L1 cache miss, L2 cache hit
LDP x0, x1, [x2]         ; L1 cache miss, L2 cache miss

Basically, what I really want to know is "when is it faster to load a value from memory rather than compute it? (on a Raspberry Pi 4B)"

There's the page Approximate cost to access various caches and main memory? but that refers to Intel chips.

colinh
  • 29
  • 5

1 Answers1

0

I found https://developer.arm.com/documentation/uan0016/a/ from which it appears that a LDR from L1-cache has latency 4 and throughput 1. While a basic ALU op has latency 1 and throughput 2.

colinh
  • 29
  • 5