times of the Fortran matmul function with different multiplication sizes

Question

I have calculated the time spent by the Fortran's MATMUL function with different multiplication sizes (32 × 32, 64 × 64, ...) and I have questions about the results.

These are the results:

SIZE ----- TIME IN SECONDS
32   -----   0,000071
64   -----   0,000032
128  -----   0,001889
256  -----   0,010866
512  -----   0,043
1024 -----   0,336
2048 -----   2,878
4096 -----  51,932
8192 ----- 405,921856

I guess the times should increase by a factor of 8 (m * 2 * n * 2 * k * 2). I do not know if it should be like that. If so, who can tell why it is not like that?

In addition, we see an increase of a factor of 18 with multiplications of 2048 a 4096. Could someone tell me why?

I have measured the times with CALL CPU_TIME() from Fortran and with CALL DATE_AND_TIME() from Fortran and both give very similar results.

My processor is an AMD Phenom (tm) II X4 945 Processor with 4 cores

Compiler name, version and compile options? Datatype of array? Memory size? You may be running into memory traffic delays due to insufficient cache. Some compilers have an option to use a specially optimized MATMUL that can help (especially with multiple cores.) Your question, as stated, omits lots of factors that can make a difference. — Steve Lionel, Jun 30 '19 at 13:41
This is mildly related: [What is cache friendly code](https://stackoverflow.com/questions/16699247/what-is-a-cache-friendly-code) — kvantour, Jul 01 '19 at 08:43
Another interesting answer here: [How to optimize matrix multiplication on a single core](https://stackoverflow.com/a/54546544/8344060) — kvantour, Jul 01 '19 at 09:35

score 2 · Accepted Answer · answered Jun 30 '19 at 16:19

@Steve is correct, there are many factors that affect performance especially when data sizes are small. Thats why all of your results at and below 2048 are pretty much semi-random and essentially irrelevant. All or most of the data is likely in several layers of CPU cache. So flushing CPU threads and other hardware related events are making these results very skewed. If you run these tests again you will find different results at these small sizes.

So, when you go from 2048 to 4096 you get a major jump. All the data no longer fits into the CPU caches. The computer needs to load blocks of data from RAM into the CPU caches. This explains the large jump in time.

It is at these sizes and larger that the computer has to do more typical operations (load data, perform operations, save data to RAM) and this is the performance you will get as data gets even larger. This is also where performance becomes very consistent as data grows larger. Notice that going from 4096 to 8192 is very close to exactly 8 times longer. At this point, going to 16384 will take almost exactly 8 times 406 seconds.

Any size smaller than 4096 is not giving your computer enough work to accurately measure the performance.

score 1 · Answer 2 · answered Jun 30 '19 at 22:12

There should be a factor 8 between each timing, and deviations are generally due to memory management like cache alignment and cache- vs array-size. For small arrays there might be a calling overhead to matmul(). A triple do-loop can be faster, at least with some optimization (try -O3 -march=native), and should work equally well for small sizes.

times of the Fortran matmul function with different multiplication sizes

2 Answers2