If you need to read any bytes from a cache line, the core has to grab the whole cache line in MESI Shared state. On Haswell and later, the data path between L2 and L1d cache is 64 bytes wide, (https://www.realworldtech.com/haswell-cpu/5/), so literally the whole line arrives at the same time, in the same clock cycle. There's no benefit to only reading the low 2 bytes vs. the high and low byte, or byte 0 and byte 32.
The same thing is essentially true on earlier CPUs; lines are still sent as a whole, and arrive in a burst of maybe 2 to 8 clock cycles. (AMD multi-socket K10 could even create tearing across 8-byte boundaries when sending a line between cores on different sockets over HyperTransport, so it allowed cache reads and/or writes to happen between cycles of sending or receiving a line.)
(Letting a load start when the byte it needs arrives is called "early restart" in CPU-architecture terminology. An associated trick is "critical word first", where reads from DRAM start the burst with the word needed by the demand load that triggered it. Neither of these are a big factor in modern x86 CPUs with data paths as wide as a cache line, or close to it, where a line might arrives in 2 cycles. It's probably not worth sending a word-within-line as part of requests for cache lines, even in cases where the request didn't just come from HW pretch.)
Multiple cache-miss loads to the same line don't take up extra memory-parallelism resources. Even in-order CPUs usually don't stall until something tries to use a load result that isn't ready. Out-of-order execution can definitely keep going and get other work done while waiting for an incoming cache line. On Intel CPUs, for example, an L1d miss for a line allocates a Line-Fill buffer (LFB) to wait for the incoming line from L2. But further loads to the same cache line that execute before the line arrives just point their load-buffer entry to the LFB that's already allocated to wait for that line, so it doesn't reduce your ability to have multiple outstanding cache misses (miss under miss) to other lines later.
And any single load that doesn't cross a cache-line boundary has the same cost as any other, whether it's 1 byte or 32 bytes. Or 64 bytes with AVX512. The few exceptions to this I can think of are:
- unaligned 16-byte loads before Nehalem:
movdqu
decodes to extra uops, even if the address is aligned.
- SnB/IvB 32-byte AVX loads take 2 cycles in the same load port for the 16-byte halves.
- AMD may have some penalties for crossing 16-byte or 32-byte boundaries with unaligned loads.
- AMD CPUs before Zen2 split 256-bit (32-byte) AVX/AVX2 operations into two 128-bit halves, so this rule of same cost for any size only applies up to 16 bytes on AMD. Or up to 8 bytes on some very old CPUs that split 128-bit vectors into 2 halves, like Pentium-M or Bobcat.
- integer loads may have 1 or 2 cycle lower load-use latency than SIMD vector loads. But you're talking about the incremental cost of doing more loads, so there's no new address to wait for. (Just a different immediate displacement from the same base register, presumably. Or something cheap to calculate.)
I'm ignoring the effects of reduced turbo clocks from using 512-bit instructions, or even 256-bit on some CPUs.
Once you pay the cost of a cache miss, the rest of the line is hot in L1d cache until some other thread wants to write it and their RFO (read-for-ownership) invalidates your line.
Calling a non-inline function 64 times instead of once is obviously more expensive, but I think that's just a bad example of what you're trying to ask. Maybe a better example would have been two int
loads vs. two __m128i
loads?
Cache misses aren't the only thing that costs time, although they can easily dominate. But still, a call+ret takes at least 4 clock cycles (https://agner.org/optimize/ instruction tables for Haswell show each of call/ret have one per 2 clock throughput, and I think this is about right), so looping and calling a function 64 times on the 64 bytes of a cache line takes at least 256 clock cycles. That's probably longer than the inter-core latency on some CPUs. If it could inline and auto-vectorize with SIMD, the incremental cost beyond the cache miss would be vastly smaller, depending on what it does.
A load that hits in L1d is extremely cheap, like 2 per clock throughput. A load as a memory operand for an ALU instruction (instead of needing a separate mov
) can decode as part of the same uop as the ALU instruction, so not even cost extra front-end bandwidth.
Using an easier-to-decode format that always fills a cache-line is probably a win for your use-case. Unless it means looping more times. When I say easier to decode, I mean fewer steps in the calculation, not simpler-looking source code (like a simple loop that runs 64 iterations.)