Cache bandwidth per tick for modern CPUs

Question

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?

Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.

PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.

@osgx: Move to serverfault etc, not a programming question is it? — TFD, Mar 01 '10 at 01:17
Consult "Analyzing Cache Bandwidth on the Intel Core 2 Architecture" by Robert Sch¨one, Wolfgang E. Nagel, and Stefan Pfl¨uger, Center for Information Services and High Performance Computing, Technische Universit¨at Dresden, 01062 Dresden, Germany In this paper, measured bandwidths between the computing cores and the different caches are presented. The STREAM benchmark1 is one of the most used kernels by scientists to determine the memory bandwidth. For deeper insight the STREAM benchmark was redesigned to get exact values for small problem sizes as well. — osgx, May 22 '10 at 23:59
So do you want to know the answer in "maximal rate of load/store instructions" or "bytes loaded/cycle"? The answer is quite different. Recent CPUs are limited more by instructions (e.g., 2 loads/cycle) than bytes (so a byte load and a 32-byte load have about the same cost), at least in cache levels close to the core. For DRAM it is more about cache lines/cycle: i.e., it doesn't matter if you load an entire line or 1 byte from it, it costs the same. — BeeOnRope, Jan 01 '18 at 05:37

score 10 · Accepted Answer · edited May 23 '17 at 11:48

For nehalem: rolfed.com/nehalem/nehalemPaper.pdf

Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache.

128 bit = 16 bytes / clock read AND 128 bit = 16 bytes / clock write (can I combine read and write in single cycle?)

The L2 and L3 caches each have a 256-bit port for reading or writing, 
but the L3 cache must share its port with three other cores on the chip.

Can L2 and L3 read and write ports be used in single clock?

Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.

Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7

           L1     L2     L3, cycles;   mem             link
Core 2      3     15     --           66 ns           http://www.anandtech.com/show/2542/5
Core i7-xxx 4     11     39          40c+67ns         http://www.anandtech.com/show/2542/5
Itanium     1     5-6    12-17       130-1000 (cycles)
Itanium2    2     6-10   20          35c+160ns        http://www.7-cpu.com/cpu/Itanium2.html
AMD K8            12                 40-70c +64ns     http://www.anandtech.com/show/2139/3
Intel P4    2     19     43          200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3     20                 180 (cycles)     --//--
AthlonFX-51 3     13                 125 (cycles)     --//--
POWER4      4     12-20  ??          hundreds cycles  --//--
Haswell     4     11-12  36          36c+57ns         http://www.realworldtech.com/haswell-cpu/5/

And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html

More about lat_mem_rd program is in its man page or here on SO.

Answering your own question ? You still haven't explained what is is that you are trying to achieve with this information. You may get a better answer if you do. — Paul R, Mar 01 '10 at 11:09
Does 256-bit port for L2 cache mean that, in an L1 cache miss and L2 cache hit, and suposing 64 Bytes cache block, the reading of the L2 block to write it in the L1 cache will last for 2 cicles? — isma, Jun 20 '20 at 20:47

Paul R · Answer 2 · 2010-03-01T11:06:15.193

7

Widest read/writes are 128 bit (16 byte) SSE load/store. L1/L2/L3 caches have different bandwidths and latencies and these are of course CPU-specific. Typical L1 latency is 2 - 4 clocks on modern CPUs but you can usually issue 1 or 2 load instructions per clock.

I suspect there's a more specific question lurking here somewhere - what is it that you are actually trying to achieve ? Do you just want to write the fastest possible memcpy ?

edited Mar 01 '10 at 11:06

answered Mar 01 '10 at 09:36

Paul R

208,748
37
389
560

Thanks. How many SSE loads can be issued per clock? I want to find peak load/store bandwidth for several generations of x86. Not only the memcpy, also a plain read and plain write (closer to STREAM benchmark) – osgx Mar 01 '10 at 09:51
1

@osgx - it depends on the CPU - Core 2 and Core i7 can both *issue* 2 SSE loads per clock – Paul R Mar 01 '10 at 11:07
About fastest memcpy - Yes, the question can be reasked as "What is the theoretical fastest memcpy" (without actual implementation) and not only for very big data (as usual), but for small too (up to L1/2 size, up to L2/2 size, L3/3 size). – osgx Mar 01 '10 at 14:39

Cache bandwidth per tick for modern CPUs

2 Answers2

Linked