L1 cache - how many simultaneous loads from the same cache line can x86/x64 do?

Question

I have some code which reads from an array. The array is largish. I'd expect it to live substantially in L2 cache. Call this TOld.

I wrote an alternative that reads from an array that fits mainly in a single cache line (that I don't expect to be evicted). Call this TNew.

They should produce the same results, and they do. TOld does a single read of its array to get its result. TNew does 6 reads (and a few simple arithmetic ops which are negligible). In both cases I'd expect the reads to dominate.

Cost of L2 cache read by TOld ~15 cycles. Cost of L1 cache reads by TNew ~5 cycles, but I do 6 of them, so expect total ~30 cycles. So I'd expected TNew should be about half the speed of TOld. Instead it's just a few percent difference.

This suggests that the L1 cache is capable of doing 2 reads simultaneously, and from the same cache line. Is that possible in x86/x64?

Other alternative is I haven't correctly aligned TNew's array to land in a single cache line and it's in straddling 2 cache lines, maybe that allows 2 simultaneous reads, one per line. Is that possible?

Frankly neither seem credible, but opinions welcome.

x86-64 is an ISA, not a microarchitecture. Specific microarchitectures have different performance. But modern mainstream Intel and AMD can both do 2 loads per clock (from the same or different cache lines). Intel since SnB, AMD since K8. [How can cache be that fast?](https://electronics.stackexchange.com/a/329955). Intel SnB/IvB had the possibility of bank-conflicts when reading different cache lines, but HSW and later removed even that. — Peter Cordes, Oct 09 '20 at 06:33
Also note that execution is pipelined. ~5 cycles is the L1d load-use latency, but you only add up the times if one load is dependent on the previous (e.g. following the `next` pointers in a linked list). Otherwise pipelined execution can overlap accesses, even on cache miss. (That's why modern x86 CPUs have so many load buffers) — Peter Cordes, Oct 09 '20 at 06:38
@PeterCordes - I guesed it would be either you or BeeOnRope! Thanks for a precise answer. You're right about uArch vs ISA. Chips's a haswell, and I'll follow your 'how can...' link. Have to say I'm truly amazed it's possible to support this. — user3779002, Oct 09 '20 at 07:17
L1d cache is multi-ported to support multiple reads (and a write) in the same cycle, very much like how the register file has to be multi-ported to support reading like 8 values and writing 4 in a single cycle. Accessing the *same* line multiple times is no harder than accessing different lines, especially when both accesses are reads. See also the linked duplicates; I found some that cover what I said in comments. — Peter Cordes, Oct 09 '20 at 07:20
Given your answer, it must be dual ported. But register are accessed by integers (exc. shadow regs). Cache is accessed by content, so content addressed, so expensive/fast/hot associative lookup circuitry. Didn't expect that could support parallel accesses *as well*. Reading your link now. — user3779002, Oct 09 '20 at 07:30
The actual cache SRAM array is multi-ported; the tag-comparators are replicated for each port to determine hit/miss independently and mux the right data to the output, or indicate miss. (https://courses.cs.washington.edu/courses/cse378/07au/lectures/L16-More-Cache-Organizations.pdf shows how the number of comparators only scales with the number of ways per set, not the cache size. For a multi-ported cache, also scaling with # of ports). Yes, the register file is indexed by integers known to be in range (from the RAT = register allocation table for reg renaming), and doesn't need comparators. — Peter Cordes, Oct 09 '20 at 07:47
However, actual content-addressable memory (CAM) is needed for the TLBs, which are also multi-ported ([VIPT Cache: Connection between TLB & Cache?](https://stackoverflow.com/q/46480015)/). TLB entries don't have a lot of data per entry, compared to the size of a cache line. Modern TLBs are large enough that they're often not fully associative, so maybe not a true CAM. Practical CPU caches definitely aren't fully associative, just 8-way, so basically you index a set and fetch (in parallel) all the tags and data from that set and feed them to a set of comparators / muxes. — Peter Cordes, Oct 09 '20 at 07:52
"...and fetch (in parallel) all the tags and data from that set and feed them to a set of comparators / muxes" There's the key misunderstanding. I thought each cache line had its own complete set of comparators. All cleared up, thanks again. — user3779002, Oct 09 '20 at 08:25
Ok coo, I wondered if that was the disconnect. Adding [VIPT Cache: Connection between TLB & Cache?](https://stackoverflow.com/q/46480015) where my answer goes through the details of a cache hit in enough detail to mention the comparators receiving data from the array and a tag to compare against from the TLB. — Peter Cordes, Oct 09 '20 at 08:31

L1 cache - how many simultaneous loads from the same cache line can x86/x64 do?

0 Answers0