The newer ARM Architecture Reference Manuals don't give instruction timings any more. (Instruction timings were given, at least for the early ARM2 and ARM3 chips).
I know that cache misses result in external memory accesses that are very slow, compared with, say, data instructions like ADD x0, x1, x2 or BIC x0, x1, x2.
But how fast is a L1 cache hit?
If the answer is "it depends ..." what would be a rough guess (ballpark) figure?
Cache enabled (obviously). "Flat" memory mapping (ie. virtual address = physical address).
I suppose the answer also depends on the precise hardware being used. And that one should simply write test cases and measure the specific timings one's interested in...
I'm interested in the ARMv8 Raspberry Pi models -- which I don't possess. (I'm using QEMU).
I'd also be interested in any other timings, say, relative to:
ADD x0, xzr, xzr ; == 1
ADD d0, d1, d2 ; floating-point
LDR x0, [x2] ; L1 cache hit
LDR x0, [x2] ; L1 cache miss, L2 cache hit
LDR x0, [x2] ; L1 cache miss, L2 cache miss
LDP x0, x1, [x2] ; L1 cache hit
LDP x0, x1, [x2] ; L1 cache miss, L2 cache hit
LDP x0, x1, [x2] ; L1 cache miss, L2 cache miss
Basically, what I really want to know is "when is it faster to load a value from memory rather than compute it? (on a Raspberry Pi 4B)"
There's the page Approximate cost to access various caches and main memory? but that refers to Intel chips.