I have some code which reads from an array. The array is largish. I'd expect it to live substantially in L2 cache. Call this TOld.
I wrote an alternative that reads from an array that fits mainly in a single cache line (that I don't expect to be evicted). Call this TNew.
They should produce the same results, and they do. TOld does a single read of its array to get its result. TNew does 6 reads (and a few simple arithmetic ops which are negligible). In both cases I'd expect the reads to dominate.
Cost of L2 cache read by TOld ~15 cycles. Cost of L1 cache reads by TNew ~5 cycles, but I do 6 of them, so expect total ~30 cycles. So I'd expected TNew should be about half the speed of TOld. Instead it's just a few percent difference.
This suggests that the L1 cache is capable of doing 2 reads simultaneously, and from the same cache line. Is that possible in x86/x64?
Other alternative is I haven't correctly aligned TNew's array to land in a single cache line and it's in straddling 2 cache lines, maybe that allows 2 simultaneous reads, one per line. Is that possible?
Frankly neither seem credible, but opinions welcome.