Matrix multiplication speed issues

Question

I am investigating how cache misses influence speed of computation. I know there are many algorithms better for multiplying two matrices (even simple exchange of two of the loops below would help), but please consider this code:

float a[N][N];
float b[N][N];
float c[N][N];
// ...
{
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            float sum = 0.0;
            for (int k = 0; k < N; k++) {
                sum = sum + a[i][k] * b[k][j];
            }
            c[i][j] = sum;
        }
    }
}

I have recompiled this code for many values of N and measured time to run it. I expected to find a sudden increase of time at around N=1250, at which point matrix c no longer fits in cache (size of c is then 1250*1250*sizeof(float)=6250000, or roughly 6MB, which is the size of my L3 cache).

Indeed, the general trend is that after that point, the average time roughly triples compared to extrapolated time from before. But the value of N%8 seems to have a huge influence on the result. For example:

1601 - 11.237548
1602 - 7.679103
1603 - 12.216982
1604 - 6.283644
1605 - 11.360517
1606 - 7.486021
1607 - 11.292025
1608 - 5.794537
1609 - 11.469469
1610 - 7.581660
1611 - 11.367203
1612 - 6.126014
1613 - 11.730543
1614 - 7.632121
1615 - 11.773091
1616 - 5.778463
1617 - 11.556687
1618 - 7.682941
1619 - 11.576068
1620 - 6.273122
1621 - 11.635411
1622 - 7.804220
1623 - 12.053517
1624 - 6.008985

For some time, I thought those might be alignment issues - rows of any matrix are aligned to 32 bytes when N%8==0 (first question - why 32 bytes in particular? SSE instructions, such as movaps can work on 16B aligned data).

Another idea was that this could be somehow connected to cache associativity (8-way for L1 and L2 and 12-way for L3 on my machine).

But then I noticed that for some values of N, such as 1536, there are unexpected spikes (even though the alignment should be excellent in those cases - 1536==256*6, the associativity being non-issue too - 1536==128*12==192*8). For example:

1504 - 4.644781
1512 - 4.794254
1520 - 4.768555
1528 - 4.884714
1536 - 7.949040
1544 - 5.162613
1552 - 5.083331
1560 - 5.388706

The timing is pretty consistent, so spikes of processor load are not a problem. I compile the code with optimizations turned on (-O2). My ideas are unfortunately running out. What could be a reason of such behaviour?

The large spikes at powers-of-two and small multiples of powers-of-two are likely due to: http://stackoverflow.com/questions/12264970/why-is-my-program-slow-when-looping-over-exactly-8192-elements — Mysticial, Dec 20 '16 at 20:12
Are you compiling for AVX? That will make things matter at 32-byte alignment. — Mysticial, Dec 20 '16 at 20:13
I'm using GCC's defaults, and from the assembly dump it seems AVX is unused. Thanks for the link though, I'll read it. — akrasuski1, Dec 20 '16 at 20:22
Thanks, that seems to be the issue. The L2 size is 32kB on my computer, and cache line is 64B. There should be 32kB/64B=512 cache lines available. Shouldn't the whole cache be useless after we iterate over the column then? 1536 lines should be needed to remember. Unless the computer uses something smarter than LRU of course... — akrasuski1, Dec 20 '16 at 21:42
Ah, nevermind, I looked up L1. L2 is 256kB, so the problem should manifest very clearly (256kB is about five times more than enough to hold a single column). Again, thanks for helping. The only mysterious thing left to explain is why `N%8` has such a big impact. — akrasuski1, Dec 20 '16 at 23:43

BayK · Answer 1 · 2017-04-17T03:28:05.597

The most important for your example - CPU cache line size. For CPU it is typically 64 byte. Even if your program read or write 1 byte, CPU does read/write for all line (64 byte). That is why, if your program hit the cache lines, your performance is good. If it miss, there is an extra overhead for reading/writing memory. Size of L3 cache is not so critical.

for your code

// all your stack variables are good. Compiler will optimize them well. 
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        float sum = 0.0;
        for (int k = 0; k < N; k++) {
            sum = sum + 
                  a[i][k] *  // here you are good, you read memory sequentially 
                  b[k][j]; // here, you are not good, every read comes from different cache line
        }
        c[i][j] = sum; // here doesn't matter, it is rare operation
    }
}

Similar to your case is here. This presentation explain great how to optimize such code and why it works this way. I hope you will find everything you need.

Matrix multiplication speed issues

1 Answers1