Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?

Question

I am making some matrix multiplication benchmarking, as previously mentioned in Why is MATLAB so fast in matrix multiplication?

Now I've got another issue, when multiplying two 2048x2048 matrices, there is a big difference between C# and others. When I try multiply only 2047x2047 matrices, it seems normal. Added some others for comparsion too.

1024x1024 - 10 seconds.

1027x1027 - 10 seconds.

2047x2047 - 90 seconds.

2048x2048 - 300 seconds.

2049x2049 - 91 seconds. (update)

2500x2500 - 166 seconds

That is three and a half minute difference for the 2k by 2k case.

using 2dim arrays

//Array init like this
int rozmer = 2048;
float[,] matice = new float[rozmer, rozmer];

//Main multiply code
for(int j = 0; j < rozmer; j++)
{
   for (int k = 0; k < rozmer; k++)
   {
     float temp = 0;
     for (int m = 0; m < rozmer; m++)
     {
       temp = temp + matice1[j,m] * matice2[m,k];
     }
     matice3[j, k] = temp;
   }
 }

This would be a great exam question for an advanced level C programming or OS Design class ;-) — Dana the Sane, May 19 '11 at 16:04
Have you tried testing both multidimensional [,] and jagged [][] arrays as well as 32 and 64 bit? I only tested a few times but jagged seemed more in-line with your results but jagged 64bit were high, I don't know if there are any heuristics in the jit that apply to this situation or if its cache related as previously suggested. If you want a GPGPU solution there is http://research.microsoft.com/en-us/projects/accelerator/ which should be competitive with the times in your other post. — Kris, May 19 '11 at 17:34
Somewhat naive question, but how many ops (adding/multiplying) are involved in multiplying two square matrices? — Nick T, May 19 '11 at 23:18
same problem here http://stackoverflow.com/questions/12264970/why-is-my-program-slow-when-looping-over-exactly-8192-elements?lq=1 http://stackoverflow.com/questions/7905760/matrix-multiplication-small-difference-in-matrix-size-large-difference-in-timi — phuclv, Jun 01 '14 at 02:07

zviadm · Accepted Answer · 2011-05-19T19:37:39.167

This probably has do with conflicts in your L2 cache.

Cache misses on matice1 are not the problem because they are accessed sequentially. However for matice2 if a full column fits in L2 (i.e when you access matice2[0, 0], matice2[1, 0], matice2[2, 0] ... etc, nothing gets evicted) than there is no problem with cache misses with matice2 either.

Now to go deeper in how caches works, if byte address of your variable is X, than the cache line for it would be (X >> 6) & (L - 1). Where L is total number of cache lines in your cache. L is always power of 2. The six comes from fact that 2^6 == 64 bytes is standard size of cache line.

Now what does this mean? Well it means that if I have address X and address Y and (X >> 6) - (Y >> 6) is divisible by L (i.e. some large power of 2), they will be stored in the same cacheline.

Now to go back to your problem what is the difference between 2048 and 2049,

when 2048 is your size:

if you take &matice2[x, k] and &matice2[y, k] the difference (&matice2[x, k] >> 6) - (&matice2[y,k] >> 6) will be divisible by 2048 * 4 (size of float). So a large power of 2.

Thus depending on size of your L2 you will have a lot of cache line conflicts, and only utilize small portion of your L2 to store a column, thus you wont actually be able to store full column in your cache, thus you will get bad performance.

When size is 2049, then the difference is 2049 * 4 which is not power of 2 thus you will have less conflicts and your column will safely fit into your cache.

Now to test this theory there are couple things you can do:

Allocate your array matice2 array like this matice2 [razmor, 4096], and run with razmor = 1024, 1025 or any size, and you should see very bad performance compared to what you had before. This is because you forcefully align all columns to conflict with each other.

Then try matice2 [razmor, 4097] and run it with any size and you should see much better performance.

Did you make a mistake in your last 2 paragraphs? Both trys are exactly the same. :) — Xeo, May 19 '11 at 19:36

Jonathan Moore · Answer 2 · 2011-05-19T16:08:33.320

Probably a caching effect. With matrix dimensions that are large powers of two, and a cache size that is also a power of two, you can end up only using a small fraction of your L1 cache, slowing things down a lot. Naive matrix multiplication is usually constrained by the need to fetch data into the cache. Optimized algorithms using tiling (or cache-oblivious algorithms) focus on making better use of L1 cache.

If you time other pairs (2^n-1,2^n) I expect you'll see similar effects.

To explain more fully, in the inner loop, where you access matice2[m,k], it's likely that matice2[m,k] and matice2[m+1,k] are offset from each other by 2048*sizeof(float) and thus map to the same index in the L1 cache. With an N-way associative cache you will have typically have 1-8 cache locations for all of these. Thus almost all of those accesses will trigger an L1 cache eviction, and fetching of data from a slower cache or main memory.

+1. Sounds likely. One has to be careful with cache associativity. — Macke, May 19 '11 at 18:14

Dana the Sane · Answer 3 · 2011-05-19T19:19:21.507

15

This may have to do with the size of your cpu cache. If 2 rows of the matrix matrix do not fit, then you will loose time swapping in elements from RAM. The extra 4095 elements may just be enough to prevent rows from fitting.

In your case, 2 rows for 2047 2d matrices fall within 16KB of memory (assuming 32 bit types). For example, if you have an L1 cache (closest to the cpu on the bus) of 64KB, then you can fit at least 4 rows (of 2047 * 32) into the cache at once. With the longer rows if there is any padding required that pushes the pairs of rows beyond 16KB, then things start to get messy. Also, each time you 'miss' the cache, swapping in data from another cache or main memory delays things.

My guess is that the variance in run times you're seeing with the different sized matrices is affected by how effectively the operating system can make use of the available cache (and some combinations are just problematic). Of course this is all a gross simplification on my part.

edited May 19 '11 at 19:19

answered May 19 '11 at 15:26

Dana the Sane

14,762
8
58
80

2

but it is very unlikely he has 16.7 MB of CPU cache – Marino Šimić May 19 '11 at 15:33
I updated the results with 2049x2049 - 91 seconds. If it was "cache problem", shouldn't this still be 300+ s? – Wolf May 19 '11 at 15:35
@Marino the answer has been updated to take that into account. – Dana the Sane May 19 '11 at 15:36
@Marino Šimić: 2048*2048 is exactly 16 MiB. – Guffa May 19 '11 at 15:37
@Guffa: sure, I didnt divide by 1024x1024, but that should not change the fact that finding 16MB cache CPUs is still hard :) – Marino Šimić May 19 '11 at 15:43
1

I feel like none of these explanations can adequately address the new details regarding the various and sparse sizes that elicit the issue, with others in between being unaffected. – kqnr May 19 '11 at 15:50
I don't think we can do more than scratch the surface unless an expert on memory subsystems happens to wander by. I think these answers narrow things down pretty well though. – Dana the Sane May 19 '11 at 15:58
@Marino Šimić: The cache doesn't have to be exactly the size of one of the matrices, it's enough difference if half of a matrix fits in the cache or slightly less then half. As the matrices are accessed both vertically and horisontally they will be swapped in and out of the cache a lot of times. Besides, there are three matrices, so you would need 48 MiB to fit all of them in the cache. – Guffa May 19 '11 at 16:35
2

I dont think this explanation is correct. The problem lies with not utilizing cache capacity fully due to cache line conflicts when size is power of 2. Also operating system has really nothing to do with caches, because it is not OS that decides what to cache and what to evict, it is all in hardware. OS has something to do with data alignment, but in this case it is all about how C# decides to allocate data and how to represent 2D array in memory, OS has nothing to do with it. – zviadm May 19 '11 at 19:00
@zvio: I agree. Downvoting to try to get the more correct answers to top. – Macke May 19 '11 at 19:13
@Macke feel free to make edits to *any* of the answers if you feel they could be improved. Better answers are more effective than hiding problems under the carpet ; ) This answer is now community wiki. – Dana the Sane May 19 '11 at 19:21
@Dana: I've upvoted the ones that actually mention cache associativity. – Macke May 19 '11 at 19:44

score 11 · Answer 4 · answered May 19 '11 at 21:29

Louis Brandy wrote two blog posts analyzing exactly this issue:

More Cache Craziness and Computational Performance - A beginners case study with some interesting statistics and attempts to explain the behavior in more detail, it does indeed come down to cache size limitations.

score 5 · Answer 5 · answered May 19 '11 at 16:34

5

Given that the time is dropping at larger sizes wouldn't it be more likely to be cache conflicts, especially with powers of 2 for the problematic matrix sizes? I am no expert on caching issues, but excellent info on cache related performance issues here.

answered May 19 '11 at 16:34

Section 5 of the link on cache associativity seems to apply in particular. – Dana the Sane May 19 '11 at 16:47

score 4 · Answer 6 · answered May 19 '11 at 17:09

4

As you are accessing the matice2 array vertically, it will be swapped in and out of the cache a lot more. If you mirror the array diagonally, so that you can access it using [k,m] instead of [m,k], the code will run a lot faster.

I tested this for 1024x1024 matrices, and it is about twice as fast. For 2048x2048 matrices it's about ten times faster.

answered May 19 '11 at 17:09

Guffa

687,336
108
737
1,005

This doesn't explain why 2049 is faster than 2048. – Macke May 19 '11 at 19:14
@Macke: That is because it passes some limit in the memory caching, so that there are a lot more cache misses. – Guffa May 19 '11 at 20:03
Why the downvote? If you don't say what you think is wrong, it can't improve the answer. – Guffa May 21 '11 at 18:23
Another downvote without any explanation... Is it that my answer has too few "probably", "guess" and "should" in it, like the answers that gets the most upvotes...? – Guffa May 22 '11 at 08:10

score 4 · Answer 7 · answered May 21 '11 at 06:17

Cache Aliasing

Or cache thrashing, if I can coin a term.

Caches work by indexing with low order bits and tagging with high order bits.

Imaging that your cache has 4 words and your matrix is 4 x 4. When a column is accessed and the row is any power of two in length, then each column element in memory will map to the same cache element.

A power-of-two-plus-one is actually about optimum for this problem. Each new column element will map to the next cache slot exactly as if accessing by row.

In real life, a tag covers multiple sequentially increasing addresses which will cache several adjacent elements in a row. By offsetting the bucket that each new row maps to, traversing the column doesn't replace the previous entry. When the next column is traversed, the entire cache will be filled with different rows and each row section that fit into the cache will hit for several columns.

Since the cache is vastly faster than DRAM (mostly by virtue of being on-chip) hit rate is everything.

David Heffernan · Answer 8 · 2011-05-19T15:41:00.723

2

You appear to have hit a cache size limit, or perhaps have some problems of repeatability in your timings.

Whatever the issue is, you simply should not write matrix multiplication yourself in C# and instead use an optimized version of the BLAS. That size of matrix should be multiplied in under a second on any modern machine.

edited May 19 '11 at 15:41

answered May 19 '11 at 15:33

David Heffernan

601,492
42
1,072
1,490

1

I am aware of BLAS, but the task was not to make it fastest as possible, but write and test it in various languages. This is very strange problem for me and Iam really curious why the results are like they are. – Wolf May 19 '11 at 15:49
3

@Wolf I'd find it hard to get excited about whether something that should take a second is taking 90 seconds or 300 seconds. – David Heffernan May 19 '11 at 15:51
4

The best way to learn how something works is to write it yourself and see how you can improve your implementation; this is (hopefully) what Wolf is doing. – Callum Rogers May 19 '11 at 16:22
@Callum Rogers, agreed. That is how I learned the importance of buffer sizes in file copy operations. – Kelly S. French May 19 '11 at 21:39

score 1 · Answer 9 · answered May 19 '11 at 16:16

1

Effectively utilizing the cache hierarchy is very important. You need to make sure that multidimensional arrays have data in a nice arrangement, which can be accomplished by tiling. To do this you'll need to store the 2D array as a 1D array together with an indexing mechanism. The problem with the traditional method is that although two adjacent array elements that are in the same row are next to each other in memory, two adjacent elements in the same column will be separated by W elements in memory, where W is the number of columns. Tiling can make as much as a factor-of-ten performance difference.

answered May 19 '11 at 16:16

Arlen

6,641
4
29
61

Hmm - yet an array declared as 2D (float[,] matice = new float[rozmer, rozmer];) is only ever allocated in RAM as a one dimensional array and row/stride calculations done under the hood. So why would declaring it as 1D and doing manual row/stride calculations be faster? Do you mean sol'n is allocate a big array as array of smaller tiles each of which can fit into cache where the big array would not? – Eric M May 19 '11 at 16:31
1

If your library or whatever tool you are using does tiling, then you don't need to. But if you were to use a traditional 2D array in, say, C/C++, then tiling would improve performance. – Arlen May 19 '11 at 16:35

score -1 · Answer 10 · answered May 19 '11 at 17:25

I suspect it is the result of something called "Sequential Flooding". What this is is that you are trying to loop through the list of objects that is slightly larger than the cache size, thus every single request to a the list (array) must be done from the ram, and you will not get a single cache hit.

In your case, you are looping through your arrays 2048 indexes 2048 times, but you only have space for 2047 (possibly due to some overhead from the array structure), so each time you acces an array pos, it needs to get this array pos from ram. It is then stored in the cache, but right before it is used again, it is dumped. So the cache is essentially useless, leading to a much longer execution time.

Incorrect. 2049 is faster than 2048, which refutes your claim. — Macke, May 19 '11 at 19:13
@Macke: That is quite possible. But there is a _slight_ chance that the cache policy used in his processor might still make this descision. It in not very likely, but it is not unthinkable. — Automatico, May 19 '11 at 21:59

Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?

10 Answers10

Cache Aliasing

Linked