6

When learning assembly I realized that I should put frequently accessed data in registers instead of memory for memory is much slower.

The question is, how can CPU run faster than memory since the instructions are fetched from memory in the first place? Does CPU usualy spend a lot of time waiting for instructions from memory?

EDIT: To run a program, we need to compile it to a file containing machine codes. Then we load that file into memory, and run one instruction after another. The CPU needs to know what instruction to run, and that piece of information is fetched from memory. I'm not asking about manipulating data but about the process of reading the instructions from memory. Sorry if I wasn't clear enough.

EDIT 2:

Example: xor eax, eax compiles to 31c0 on my computer. I know this instruction itself is fast. But to clear eax, the CPU needs to read 31c0 from memory first. That read should take a lot of time if accessing memory is slow, and for this period the CPU just stalls?

Qian
  • 1,462
  • 1
  • 14
  • 25
  • If this is off-topic, please let me know where I should ask instead. – Qian Aug 22 '17 at 13:47
  • You can add example code and benchmarks in order to make this more on topic – raam86 Aug 22 '17 at 13:49
  • @raam86 I really don't know how I can benchmark this. – Qian Aug 22 '17 at 14:10
  • @TheoYou Try the code in my answer. Remember to turn off optimizations. – klutt Aug 22 '17 at 14:21
  • @klutt Please take a careful look at my example. I am not asking about manipulating data in memory. I'm not a native English speaker, I'm sorry if I can be clearer. – Qian Aug 22 '17 at 14:34
  • Ah, I misunderstood. In general, there's no good way to benchmark it. You have to look at the overall performance and use your knowledge about caches. – klutt Aug 22 '17 at 14:37

3 Answers3

6

Code fetch in parallel with instruction execution is so critical that even 8086 did it (to a limited extent, with a very small prefetch buffer and low bandwidth). Even so, code fetch bandwidth actually was THE major bottleneck for 8086.

(I just realized you didn't tag this x86, although you did use an x86 instruction as an example. All my examples are x86, but the basics are pretty much the same for any other architecture. Except that non-x86 CPUs won't use a decoded-uop cache, x86 is the only ISA still in common use that's so hard to decode that it's worth caching the decode results.)


In modern CPUs, code-fetch is rarely a bottleneck because caches and prefetching hide the latency, and bandwidth requirements are usually low compared to the bandwidth required for data. (Bloated code with a very large code footprint can run into slowdowns from instruction-cache misses, though, leading to stalls in the front-end.)

L1I cache is separate from L1D cache, and CPUs fetch/decode a block of at least 16 bytes of x86 code per cycle. CPU with a decoded-uop cache (Intel Sandybridge family, and AMD Ryzen) even cache already-decoded instructions to remove decode bottlenecks.

See http://www.realworldtech.com/sandy-bridge/3/ for a fairly detailed write-up of the front-end in Intel Sandybridge (fetch/pre-decode/decode/rename+issue), with block diagrams like this, showing Intel Sandybridge vs. Intel Nehalem and AMD Bulldozer's instruction-fetch logic. (Decode is on the next page). The "pre-decode" stage finds instruction boundaries (i.e. decodes instruction-length ahead of decoding what each instruction actually is).

David Kanter's SnB writeup

L1I cache misses result in a request to the unified L2. Modern x86 CPUs also have a shared L3 cache (shared between multiple cores).

Hardware prefetching brings soon-to-be-needed code into L2 and L1I, just like data prefetching into L2 and L1D. This hides the > 200 cycle latency to DRAM most of the time, usually only failing on jumps to "cold" functions. It can almost always stay ahead of decode/execute when running a long sequence of code with no taken branches, unless something else (like data loads/stores) is using up all the memory bandwidth.

You could construct some code that decodes at 16 bytes per cycle, which may be higher than main-memory bandwidth. Or maybe even higher on an AMD CPU. But usually decode bottlenecks will limit you more than pure code-fetch bandwidth.


See also Agner Fog's microarch guide for more about the front-end in various microarchitectures, and optimizing asm for them.

See also other CPU performance links in the tag wiki.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
4

If you have frequently accessed data, chances are that you also have the same instructions repeatedly processing them. An efficient CPU will not fetch the same instructions again and again from a slow memory. Instead, they are put in an instruction cache which has very little access time. Therefore, the cpu doesn't need to wait for instructions in general.

Sydney Hauke
  • 171
  • 1
  • 6
  • What if there are no jumps in my program, when every instruction is executed once? – Qian Aug 22 '17 at 14:28
  • This a special case that is very unlikely to happen. Still, instructions would be fetched by _block_ and not one by one, which will in the end minimize the mean access time for instruction fetching. – Sydney Hauke Aug 22 '17 at 14:37
  • @TheoYou: Hardware prefetching into the instruction cache can usually stay ahead of execution, hiding the latency of code fetch. See my answer for some more details. – Peter Cordes Aug 22 '17 at 23:23
3

The memory is very slow compared to the CPU. Fetching data from RAM costs roughly 200 clock cycles, so in general it is very important for performance to write cache friendly code. And yes, the CPU spends a lot of time waiting for data.

Why this is the case? Well, it's simply different kinds of memory. In general it is more expensive to create fast memory, so in order to keep costs down, the fastest memory is reserved for the registers. The physical distance can be a limit for speed too. Memory you want to access fast needs to be close to the core. Light travel at a speed of around 300 000km/s, which means around 0.3mm/ns. If the memory is 0.3mm away, it's physically impossible to get the data under one nanosecond. The RAM is typically 10cm away, making it physically impossible to access under around 30ns. Modern CPU:s works with a frequency of GHz, so we have already reached the barrier where it would be impossible (not hard, impossible) to make the memory keep up with the CPU.

However, this physical limitation (theory of relativity) only affects access time and not bandwidth. So when you fetch data at address addr it does not cost anything extra to also fetch addr+1.

Between the registers and RAM you have cache. In a modern computer, it is typically three layers of cache. This works similarly to when data from a hard drive is cached in RAM. When you read a bit of data, it is likely that you will need surrounding data soon, so surrounding data is read at the same time and stored in the cache. When you ask for the next piece of data, it is likely to be in the cache. Whenever you request something from the memory, there are circuits that checks whether that piece of memory already exists in the cache or not.

You cannot control the cache directly. What you can do is to write cache friendly code. This can be tricky for advanced cases, but in general, the trick is to not jump around large distances in memory. Try to access the memory sequentially.

Here is a simple example of how to write cache friendly:

int *squareMatrix=malloc(SIZE*SIZE*sizeof(*squareMatrix));
int sum=0;
for(int i=0; i<SIZE; i++) 
   for(int j=0; j<SIZE; j++) 
      sum+=squareMatrix[i*SIZE+j];

And a non cache friendly version:

int *squareMatrix=malloc(SIZE*SIZE*sizeof(*squareMatrix));
int sum=0;
for(int i=0; i<SIZE; i++) 
   for(int j=0; j<SIZE; j++) 
      sum+=squareMatrix[j*SIZE+i];

The difference is [j*SIZE+i] vs [i*SIZE+j]. The first version reads the whole matrix sequentially, greatly increasing the chance that the next element will already be in the memory when you ask for it.

Here is the difference of the above code on my computer with SIZE=30000:

$ time ./fast

real    0m2.755s
user    0m2.516s
sys     0m0.236s

$ time ./slow

real    0m18.609s
user    0m18.268s
sys     0m0.340s

As you can see, this can affect performance significantly.

Typical access times and sizes for different types of memory. Very approximate, and just to give a general idea of it:

Memory type           # Clock tics     Size
===================|================|=============
register           |             1  |    8B each, around 128B total
level1 cache       |             5  |    32kB
level2 cache       |            10  |    1MB
level3 cache       |            50  |    20MB
RAM                |           200  |    16GB
SSD drive          |        10,000  |    500GB
Mechanical drive   |     1,000,000  |    4TB

It could also be mentioned that the level1 cache is typically split into data and code.

klutt
  • 30,332
  • 17
  • 55
  • 95
  • *it does not cost anything extra to also fetch addr+1*: Not quite true. It takes an extra buffer to track an extra outstanding memory request. Memory parallelism is limited by the number of buffers. e.g. for data, the number of line-fill buffers is what limits single-core `memset` bandwidth, and why a big Xeon with higher-latency L3 / RAM has lower single-core bandwidth than a quad-core desktop CPU, even though with multiple cores active the Xeon has much higher aggregate bandwidth. See "latency bound platforms" in https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy – Peter Cordes Aug 22 '17 at 21:57
  • 1
    One problem with this answer is that it doesn't even mention that L1I cache is separate from L1D cache, and has a separate iTLB. Your example is a data access pattern, not code that jumps around. (Of course that would be complicated by branch prediction limitations, but you could do something that caused conflict misses in L1I before running into branch mispredicts, because it's only 8-way associative in Intel Sandybridge-family, for example.) – Peter Cordes Aug 22 '17 at 22:00