4

I recently noticed that seemingly small changes in how a matrix is accessed in C can have a big impact on performance. For example, let's imagine we have these two fragments of C code. This one:

for(i = 0; i < 2048; i++)
{
    for(j = 0; j < 2048; j++) {
            Matrix[i][j] = 9999;    
    }
}

And this one:

for(j = 0; j < 2048; j++)
{
    for(i = 0; i < 2048; i++) {
            Matrix[i][j] = 9999;    
    }
}

The second version is 2x slower than the first version. Why? I think it has to do with memory management: in each loop, the first version accesses positions in the memory that are located next to each other, while the second version has to "jump" to different regions in each loop. Is this intuition right? Also, if I make the matrix small (64x64 for example), then there is no difference in performance. Why? I would appreciate if someone could provide an intuitive and rigorous explanation. I am using Ubuntu 14.04 LTS, by the way.

Peter
  • 82
  • 5

4 Answers4

5
        for(i=0;i<2048;i++)
        {
                for(j=0;j<2048;j++) {
                        Matrix[i][j]=9999;    
                }
        }

This form uses the L1, L2 and L3 cache alignment. When you loop with j over Matrix[i][j], the elements Matrix[i][0], Matrix[i][1]...a.s.o. are aligned at consecutive addresses (actually at addresses differing by sizeof(Matrix[i][0])), so an access of Matrix[i][0] brings in the cache memory the next variable Matrix[i][1].

On the other hand,

        for(j=0;j<2048;j++)
        {
                for(i=0;i<2048;i++) {
                        Matrix[i][j]=9999;    
                }
        }

the inner loop accesses in the order Matrix[0][j], Matrix[1][j]... a.s.o. The address of Matrix[1][j] is Matrix[0][j]+2048*sizeof(Matrix[0][0]) -- supposing that you allocated 2048 entries for the array Matrix[0].

So Matrix[0][j] is in another block of cache than Matrix[1][j], requiring the fetch to make an access in RAM instead in cache.

In this second case there is an access to RAM at each iteration.

alinsoar
  • 15,386
  • 4
  • 57
  • 74
3

"It's the cache! It's the cache!"

To visualise it, think about memory as a linear array...

By defining a 2D array:

uint8_t Matrix[4][4]

You are simply saying:

allocate 16 bytes, and access them as a 2D array, 4x4

This example assumes a 4-byte cache, to make things simple:

2D array in memory

If the CPU's cache is only able to hold 4 bytes, then approaching the array in the [0][0], [1][0], [2][0], ... form will cause a cache miss on every access - requiring us to access the RAM (which is expensive) 16 times!

Approaching the array in the [0][0], [0][1], [0][2], ... form will permit the 2D array to be accessed, in full, with only 4 cache misses.


This example is very simplified - modern systems will almost definately have an L1 and L2 cache, and many are now implementing an L3 cache too.

As you get further from the processor core, the memory gets larger and slower. For example:

  1. L1 cache (small, very very fast)
  2. L2 cache
  3. L3 cache(?)
  4. RAM
  5. Persistent storage (e.g: HDD - huge, but relatively, very, very slow).
Attie
  • 6,690
  • 2
  • 24
  • 34
2

It is related to locality of reference and the CPU cache. So it is mostly processor specific (and not much OS specific).

A cache miss can be very costly (typically, access to data on DRAM modules requires hundreds of nanoseconds - enough to execute a hundred machine instructions from L1 I-cache), but access to L1 cache requires only one or a few nanoseconds).

Read also this and that. Sometimes (but not always) using __builtin_prefetch might improve performance (but generally the GCC compiler can optimize better than you do, by emitting PREFETCH machine instructions appropriately). But using that __builtin_prefetch badly or too often will hurt performance.

Don't forget to enable optimizations in your compiler, so compile with gcc -Wall -O2 -march=native at least before benchmarking (or even -O3 instead of -O2 ...).

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
2

It's all about the cache. In the latter one, you're basically reading the memory in a sequential line. In the first case you do long jumps between each reading.

There are circuits in the computer that stores nearby data on a read, because it is likely that nearby data will be read soon. You cannot control how these circuits work. All you can do is adjusting your code to their behavior.

klutt
  • 30,332
  • 17
  • 55
  • 95