It may seem a weird question..
Say the a cache line's size is 64 bytes. Further, assume that L1, L2, L3 has the same cache line size (this post said it's the case for Intel Core i7).
There are two objects A
, B
on memory, whose (physical) addresses are N bytes apart. For simplicity, let's assume A
is on the cache boundary, that is, its address is an integer multiple of 64.
1) If N
< 64, when A
is fetched by CPU, B
will be read into the cache, too. So if B
is needed, and the cache line is not evicted yet, CPU fetches B
in a very short time. Everybody is happy.
2) If N
>> 64 (i.e. much larger than 64), when A
is fetched by CPU, B
is not read into the cache line along with A
. So we say "CPU doesn't like chase pointers around", and it is one of the reason to avoid heap allocated node-based data structure, like std::list
.
My question is, if N
> 64 but is still small, say N
= 70, in other words, A
and B
do not fit in one cache line but are not too far away apart, when A
is loaded by CPU, does fetching B
takes the same amount of clock cycles as it would take when N
is much larger than 64?
Rephrase - when A
is loaded, let t represent the time elapse of fetching B
, is t(N=70) much smaller than, or almost equal to, t(N=9999999)?
I ask this question because I suspect t(N=70) is much smaller than t(N=9999999), since CPU cache is hierarchical.
It is even better if there is a quantitative research.