cache friendly C++ operation on matrix in C++?

Question

My application does some operations on matrices of large size. I recently came accross the concept of cache & the performance effect it can have through this answer. I would like to know what would be the best algorithm which is cache friendly for my case.

Algorithm 1:
for(int i = 0; i < size; i++)
{
    for(int j = i + 1; j < size; j++)
    {
        c[i][j] -= K * c[j][j];//K is a constant double variable
    }//c is a 2 dimensional array of double variables
}

Algorithm 2:
double *A = new double[size];
for(int n = 0; n < size; n++)
    A[n] = c[n][n];

for(int i = 0; i < size; i++)
{
    for(int j = i + 1; j < size; j++)
    {
        c[i][j] -= K * A[j];
    }
}

The size of my array is more than 1000x1000. Benchmarking on my laptop shows Algorithm 2 is better than 1 for size 5000x5000. Please note that I have multi threaded my application such that a set of rows is operated by a thread.

For example: For array of size 1000x1000.
thread1 -> row 0 to row 249
thread2 -> row 250 to row 499
thread3 -> row 500 to row 749
thread4 -> row 750 to row 999

Which one of these 2 algorithms is better for an average CPU? — Cool_Coder, Sep 21 '13 at 09:11

score 2 · Accepted Answer · answered Sep 21 '13 at 09:24

2

If your benchmarks show significant improvement for the second case, then it most likely is the better choice. But of course, to know for "an average CPU", we'd have to know that for a large number of CPU's that can be called average - there is no other way. And it will really depend on the definition of Average CPU. Are we talking "any x86 (AMD + Intel) CPU" or "Any random CPU that we can find in anything from a watch to the latest super-fast creation in the x86 range"?

The "copy the data in c[n][n]" method helps because it gets its own address, and doesn't get thrown out of the (L1) cache when the code walks its way over the larger matrix [and all the data you need for the multiplication is "close together". If you walk c[j][j], every j steps will jump sizeof(double) * (size * j + 1) bytes per iteration, so if size is anything more than 4, the next item needed wont be in the same cache-line, so another memory read is needed to get that data.

In other words, for anything that has a decent size cache (bigger than size * sizeof(double)), it's a definite benefit. Even with smaller cache, it's quite likely SOME benefit, but the chances are higher that the cached copy will be thrown out by some part of c[i][j].

In summary, the second algorithm is very likely better for nearly all options.

answered Sep 21 '13 at 09:24

Mats Petersson

126,704
14
140
227

Hey Mats, Thanks for answering! Can you please tell me how you could understand what will stay in cache & what will be thrown out, just by looking at the code? What is this concept? I would like to learn it. I could not understand some terms like "jumping", "cache line", "gets its own address". Any online information on this topic would be very helpful... – Cool_Coder Sep 21 '13 at 09:50
A "cache-line" is a unit of data in the cache - a cache will not fetch one byte at a time, but several bytes (32 or 64, typically). Jumping means that the address is not staying within/near the cache-line. Caches are organised based on the lower address of the data, so if you have multiple data items with the same lower address bits, they will compete for the same (2 or 4) entries in the cache. Having it's own address means that it's not competing quite so much with the rest of the data. – Mats Petersson Sep 21 '13 at 11:43
Assuming 64B lines, and an 8-way 32k L1 cache for e.g., the competition is on addr bits 6:11, meaning that addr 0x1000 and addr 0x2000 would compete for the same set, and if you use also 0x3000 ... 0x9000 one of these lines will have to evict. Of course it's more complicated as you have lower (and bigger) cache levels, but you can safely assume that a one-time access in a linear stream will be replaced by the time you wrap around to it at the sizes you're using. However repeated accesses would train the cache to preserve the line (replaced lines would usually be the least recently used ones). – Leeor Sep 21 '13 at 11:58

Leeor · Answer 2 · 2013-09-21T11:30:26.403

Algorithm2 benefits from what's called "spatial locality", moving the diagonal into a single dimension array makes it reside in memory in consecutive addresses, and thereby:

Enjoys the benefit of fetching multiple useful elements per a single cache line (presumably 64byte, depending on your CPU), better utilizing cache and memory BW (whereas c[n][n] would also fetch a lot of useless data since it's in the same lines).
Enjoys the benefits of a HW stream prefetchers (assuming such exist in your CPU), that aggressively run ahead of your code along the page and brings the data in advance to the lower cache levels, improving the memory latency.

It should be pointed that moving the data to A doesn't necessarily improve cacheability since A would still compete against a lot of data constantly coming from c and thrashing the cache. However, since it's used over and over, there's a high chance that a good LRU algorithm would make it stay in the cache anyway. You could help that by using streaming memory operations for array c. It should be noted that these are very volatile performance tools, and may on some scenarios lead to perf reduction if not used correctly.

Another potential benefit could come from mixing SW prefetches slightly ahead of reaching every new array line.

Thanks for adding your inputs! Can you please tell me what you mean by "streaming memory operations for array c" & the next line after that? — Cool_Coder, Sep 21 '13 at 09:59
see here for e.g. - http://blogs.fau.de/hager/2008/09/04/a-case-for-the-non-temporal-store/ — Leeor, Sep 21 '13 at 11:31

cache friendly C++ operation on matrix in C++?

2 Answers2