Reading a cacheline vs parts of a cacheline in Intel x86

Question

This question might not have a conclusive answer, but looking for general advice in this area. Let me know if this is off-topic. If I have code that reads from a cacheline that is not in the current CPU's L1 cache and reads go to a remote cache, say the object comes from a remote thread that just wrote to it and therefore has the cacheline in modified mode. Is there any incremental cost to reading the entire cacheline vs just parts of it? Or can something like this be completely parallelized?

For example, given the following code (assume that foo() lies is some other translation unit and is opaque to the optimizer, there is no LTO involved)

struct alignas(std::hardware_destructive_interference_size) Cacheline {
    std::array<std::uint8_t, std::hardware_constructive_interference_size> bytes;
};

void foo(std::uint8_t byte);

Is there any expected perf difference between this

void bar(Cacheline& remote) {
  foo(remote.bytes[0]);
}

and this

void bar(Cacheline& remote) {
    for (auto& byte : remote.bytes) {
        foo(byte);
    }
}

Or is this something that will most likely have little impact? Is the entire cacheline transferred over to the current processor before a read is completed? Or can the CPU parallelize a read and a remote cacheline fetch (in which case, waiting for the entire cacheline to be transferred might have an effect)?

For some context: I am in a situation where I know that a block of data can either be engineered to fit in a cacheline (the compression will likely not take up nearly as much CPU time as a cache miss), or it can be compressed to fit in a cacheline and be as compact as possible, so the remote can do without reading the entire cacheline. Both approaches will involve substantially different code. Just trying to figure out which one I should try out first and what the general advice is here.

Peter Cordes · Accepted Answer · 2019-02-22T05:02:17.147

If you need to read any bytes from a cache line, the core has to grab the whole cache line in MESI Shared state. On Haswell and later, the data path between L2 and L1d cache is 64 bytes wide, (https://www.realworldtech.com/haswell-cpu/5/), so literally the whole line arrives at the same time, in the same clock cycle. There's no benefit to only reading the low 2 bytes vs. the high and low byte, or byte 0 and byte 32.

The same thing is essentially true on earlier CPUs; lines are still sent as a whole, and arrive in a burst of maybe 2 to 8 clock cycles. (AMD multi-socket K10 could even create tearing across 8-byte boundaries when sending a line between cores on different sockets over HyperTransport, so it allowed cache reads and/or writes to happen between cycles of sending or receiving a line.)

(Letting a load start when the byte it needs arrives is called "early restart" in CPU-architecture terminology. An associated trick is "critical word first", where reads from DRAM start the burst with the word needed by the demand load that triggered it. Neither of these are a big factor in modern x86 CPUs with data paths as wide as a cache line, or close to it, where a line might arrives in 2 cycles. It's probably not worth sending a word-within-line as part of requests for cache lines, even in cases where the request didn't just come from HW pretch.)

Multiple cache-miss loads to the same line don't take up extra memory-parallelism resources. Even in-order CPUs usually don't stall until something tries to use a load result that isn't ready. Out-of-order execution can definitely keep going and get other work done while waiting for an incoming cache line. On Intel CPUs, for example, an L1d miss for a line allocates a Line-Fill buffer (LFB) to wait for the incoming line from L2. But further loads to the same cache line that execute before the line arrives just point their load-buffer entry to the LFB that's already allocated to wait for that line, so it doesn't reduce your ability to have multiple outstanding cache misses (miss under miss) to other lines later.

And any single load that doesn't cross a cache-line boundary has the same cost as any other, whether it's 1 byte or 32 bytes. Or 64 bytes with AVX512. The few exceptions to this I can think of are:

unaligned 16-byte loads before Nehalem: movdqu decodes to extra uops, even if the address is aligned.
SnB/IvB 32-byte AVX loads take 2 cycles in the same load port for the 16-byte halves.
AMD may have some penalties for crossing 16-byte or 32-byte boundaries with unaligned loads.
AMD CPUs before Zen2 split 256-bit (32-byte) AVX/AVX2 operations into two 128-bit halves, so this rule of same cost for any size only applies up to 16 bytes on AMD. Or up to 8 bytes on some very old CPUs that split 128-bit vectors into 2 halves, like Pentium-M or Bobcat.
integer loads may have 1 or 2 cycle lower load-use latency than SIMD vector loads. But you're talking about the incremental cost of doing more loads, so there's no new address to wait for. (Just a different immediate displacement from the same base register, presumably. Or something cheap to calculate.)

I'm ignoring the effects of reduced turbo clocks from using 512-bit instructions, or even 256-bit on some CPUs.

Once you pay the cost of a cache miss, the rest of the line is hot in L1d cache until some other thread wants to write it and their RFO (read-for-ownership) invalidates your line.

Calling a non-inline function 64 times instead of once is obviously more expensive, but I think that's just a bad example of what you're trying to ask. Maybe a better example would have been two int loads vs. two __m128i loads?

Cache misses aren't the only thing that costs time, although they can easily dominate. But still, a call+ret takes at least 4 clock cycles (https://agner.org/optimize/ instruction tables for Haswell show each of call/ret have one per 2 clock throughput, and I think this is about right), so looping and calling a function 64 times on the 64 bytes of a cache line takes at least 256 clock cycles. That's probably longer than the inter-core latency on some CPUs. If it could inline and auto-vectorize with SIMD, the incremental cost beyond the cache miss would be vastly smaller, depending on what it does.

A load that hits in L1d is extremely cheap, like 2 per clock throughput. A load as a memory operand for an ALU instruction (instead of needing a separate mov) can decode as part of the same uop as the ALU instruction, so not even cost extra front-end bandwidth.

Using an easier-to-decode format that always fills a cache-line is probably a win for your use-case. Unless it means looping more times. When I say easier to decode, I mean fewer steps in the calculation, not simpler-looking source code (like a simple loop that runs 64 iterations.)

Thanks for the informative answer, I'm still reading, and will likely have more questions tomorrow, but for now, regarding the TL;DR at the end - what did you mean by this *Unless it means looping more times*? — Curious, Feb 21 '19 at 06:35
I mean if it's the different between calling `foo()` 8 times or calling `foo()` 64 times, fewer is better. Like I said about your weird example in the question where one is obviously more expensive. — Peter Cordes, Feb 21 '19 at 06:38
*Maybe a better example would have been two int loads vs. two __m128i loads* Could you expand on this part? — Curious, Feb 22 '19 at 04:52
*I'm ignoring the effects of reduced turbo clocks from using 512-bit instructions, or even 256-bit on some CPUs.* - and this one? — Curious, Feb 22 '19 at 04:52
@Curious: the cost of `mov eax, [rdi]` / `add eax, [rdi+4]` is about the same as the cost of `movdqa xmm0, [rdi]` / `pavgb xmm0, [rdi]`. The same number of loads + ALU instructions, just wider. Instead of your example where you compare 1 single-byte load vs. 64 single-byte loads. — Peter Cordes, Feb 22 '19 at 04:55
@Curious: [AVX 512 vs AVX2 performance for simple array processing loops](//stackoverflow.com/q/52523349) has some links and details re: turbo limits on Skylake-X. Also https://www.servethehome.com/wp-content/uploads/2017/07/Intel-Skylake-SP-Microarchitecture-AVX2-AVX-512-Clocks.jpg linked from [Dynamically determining where a rogue AVX-512 instruction is executing](//stackoverflow.com/q/52008788) — Peter Cordes, Feb 22 '19 at 05:06

Reading a cacheline vs parts of a cacheline in Intel x86

1 Answers1