The memory is very slow compared to the CPU. Fetching data from RAM costs roughly 200 clock cycles, so in general it is very important for performance to write cache friendly code. And yes, the CPU spends a lot of time waiting for data.
Why this is the case? Well, it's simply different kinds of memory. In general it is more expensive to create fast memory, so in order to keep costs down, the fastest memory is reserved for the registers. The physical distance can be a limit for speed too. Memory you want to access fast needs to be close to the core. Light travel at a speed of around 300 000km/s, which means around 0.3mm/ns. If the memory is 0.3mm away, it's physically impossible to get the data under one nanosecond. The RAM is typically 10cm away, making it physically impossible to access under around 30ns. Modern CPU:s works with a frequency of GHz, so we have already reached the barrier where it would be impossible (not hard, impossible) to make the memory keep up with the CPU.
However, this physical limitation (theory of relativity) only affects access time and not bandwidth. So when you fetch data at address addr
it does not cost anything extra to also fetch addr+1
.
Between the registers and RAM you have cache. In a modern computer, it is typically three layers of cache. This works similarly to when data from a hard drive is cached in RAM. When you read a bit of data, it is likely that you will need surrounding data soon, so surrounding data is read at the same time and stored in the cache. When you ask for the next piece of data, it is likely to be in the cache. Whenever you request something from the memory, there are circuits that checks whether that piece of memory already exists in the cache or not.
You cannot control the cache directly. What you can do is to write cache friendly code. This can be tricky for advanced cases, but in general, the trick is to not jump around large distances in memory. Try to access the memory sequentially.
Here is a simple example of how to write cache friendly:
int *squareMatrix=malloc(SIZE*SIZE*sizeof(*squareMatrix));
int sum=0;
for(int i=0; i<SIZE; i++)
for(int j=0; j<SIZE; j++)
sum+=squareMatrix[i*SIZE+j];
And a non cache friendly version:
int *squareMatrix=malloc(SIZE*SIZE*sizeof(*squareMatrix));
int sum=0;
for(int i=0; i<SIZE; i++)
for(int j=0; j<SIZE; j++)
sum+=squareMatrix[j*SIZE+i];
The difference is [j*SIZE+i]
vs [i*SIZE+j]
. The first version reads the whole matrix sequentially, greatly increasing the chance that the next element will already be in the memory when you ask for it.
Here is the difference of the above code on my computer with SIZE
=30000:
$ time ./fast
real 0m2.755s
user 0m2.516s
sys 0m0.236s
$ time ./slow
real 0m18.609s
user 0m18.268s
sys 0m0.340s
As you can see, this can affect performance significantly.
Typical access times and sizes for different types of memory. Very approximate, and just to give a general idea of it:
Memory type # Clock tics Size
===================|================|=============
register | 1 | 8B each, around 128B total
level1 cache | 5 | 32kB
level2 cache | 10 | 1MB
level3 cache | 50 | 20MB
RAM | 200 | 16GB
SSD drive | 10,000 | 500GB
Mechanical drive | 1,000,000 | 4TB
It could also be mentioned that the level1 cache is typically split into data and code.