-2

I have 2 C sequences which both multiply two matrices.

Sequence 1:

int A[M][N], B[N][P], C[M][P], i, j, k;
for (i = 0; i < M; i++) 
    for (j = 0; j < P; j++)
        for (k = 0; k < N; k++)
            C[i][j] += A[i][k] * B[k][j];

Sequence 2:

int A[M][N], B[N][P], C[M][P], i, j, k;
for (i = M - 1; i >= 0; i--) 
    for (j = P - 1; j >= 0; j--)
        for (k = N - 1; k >= 0; k--)
            C[i][j] += A[i][k] * B[k][j];

My question is: which of them is more efficient when translated in Assembly language? I'm pretty sure that the second one can be written using the loop instruction, while the first one can be written using inc/jl.

  • 1
    Compile it and look at the generated assembly code. – llllllllll Jan 27 '18 at 18:06
  • 1
    It depends a great deal on the target assembly language. – Sergey Kalinichenko Jan 27 '18 at 18:08
  • 1
    If you mean x86 `loop` instruction, then you really don't want the compiler to use that one. https://stackoverflow.com/q/35742570/4271923 anyway as always with performance hunting, you should first prune your algorithm, optimize your data structures and make sure your code is correct. Then you should measure performance of it, that will be your base. Then you can either analyse the profiling results to see if there is obvious bottleneck, or you can try to profile other version to see if it goes better. Both variants in your question looks reasonable for what they are, but... – Ped7g Jan 27 '18 at 18:18
  • for example: if M, N, P can be fixed and power of two, and you provide that information to compiler, it will certainly create something measurably faster than your original code compiled for general M, N, P values. You can often get lot more by first constraining and pruning the task enough, to make it a perfect fit for target architecture. On the single instruction level the gains are not that interesting, the common modern C or C++ compiler will usually produce something reasonable for reasonable source. +-couple for percent probably still possible, but not as much as you can gain above. – Ped7g Jan 27 '18 at 18:20
  • 1
    Generally there is a slight case for a loop counter working downward: when going up, there has to be a specific comparison for the terminating condition, but when going down, no comparison is needed, just a flags test (assuming the instruction used does set the flags). – Weather Vane Jan 27 '18 at 18:23
  • Anyway, for assembly the structure of loop like: `do { ... } while (--counter);` with assumption the `counter > 0` ahead of loop - is the "perfect fit", although the compiler at -O3 will probably still generate lot more code than you would expect, as it would probably unroll the loop partially. Your `for` requires the condition check upon loop entry, that doesn't land into assembly code flow as nicely as "do {} while" one. Another common idiom is to flip the counter sign, going from `-P` toward zero, and indexing by `C[i+P]`, but that's not used by C/C++ compilers due to some corner cases. – Ped7g Jan 27 '18 at 18:29
  • 1
    @WeatherVane You can use a negative loop counter which is incremented until it is zero. – fuz Jan 27 '18 at 18:31
  • 2
    I've seen gcc produce *significantly* worse code for downward-counting loop indices, because it didn't vectorize the code. You'll have to make sure to check the generated assembly. – EOF Jan 27 '18 at 18:43

1 Answers1

3

First, you should understand that source code does not dictate what the assembly language is. The C standard allows a compiler to transform a program in any way as long as the resulting observable behavior (defined by the standard) remains the same. (The observable behavior is largely the output to files and devices, interactive input and output, and accesses to special volatile objects.)

Compilers take advantage of this rule to optimize your program. If the results of your loop are the same in either direction, then, in the best compilers, writing the loop in one direction or another has no consequence. The compiler analyzes the source code and sees that the effect of the loop is merely to perform a set of operations whose order does not matter. It represents the loop and the operations within it abstractly and later generates the best assembly code it can.

If the arrays in your example are large, then the time it takes the compiler to execute the loop control instructions is irrelevant. In typical systems, it takes dozens of CPU cycles or more to fetch a value from memory. With large arrays, the bottleneck in your example code will be fetching data from memory. The CPU will be forced to wait for this data, and it will easily complete any loop control or array address arithmetic instructions while it is waiting for data from memory.

Typical systems deal with the slow memory problem by including some fast memory, called cache. Often, there is very fast cache built into the core of the processor itself, plus some fast cache on the chip with the processor, and there are may other levels of cache. Memory in cache is organized into lines, which are segments of consecutive data from memory. Thus, one cache line may contain eight consecutive int objects. When the processor needs data that is not already in cache, an entire cache line is fetched from memory. Because of this, you can avoid the memory delay by using eight consecutive int objects. When you read the first one (or even before—the processor may predict your read and start fetching it ahead of time), all eight will be ready from memory. So your program will only have to wait for the first one. When it goes to use the second through the eight, they will already be in cache, where they are immediately available to the processor.

Unfortunately, array multiplication is notoriously bad for caches. Although your loop traverses the rows of array A (using A[i][k] where k is the fastest-varying index as your code is written), it traverses the columns of B (using B[k][j]). So consecutive iterations of your loop use consecutive elements of A but not consecutive elements of B. If the arrays are large, your program will end up waiting for elements from B to be fetched from memory. And, if you change the code to use consecutive elements from B, then it no longer uses consecutive elements from A.

With array multiplication, a typical way to deal with this problem is to split the array multiplication into smaller blocks, doing only a portion at a time, perhaps 8×8 blocks. This works because the cache can hold multiple lines at a time. If you arrange the work so that one 8×8 block from B (e.g., all the elements with a row number from 16 to 23 and a column number from 32 to 39) is used repeatedly for a while, then it can remain in cache, with all its data immediately available. This sort of rearrangement of work can speed up your program tremendously, making it many times faster. It is a much larger improvement than merely changing the direction of your loops can provide.

Some compilers can see that your loops on i, j, and k can be interchanged, and they may try to reorganize them if there is some benefit. Few compilers can break up the routines into blocks as I describe above. Also, the compiler can rearrange the work in your example only because you show A, B, and C declared as separate arrays. If these were not visible to the compiler but were instead passed as pointers to a function that was performing matrix multiplication, the compiler would not be able to see that A, B, and C point to separate arrays. In this case, it cannot know that the order of the loops does not matter. If the function were passed a C that points to the same array as A, the function would be overwriting some of its input while calculating outputs, and so the loop directions would matter.

There are a variety of matrix multiplication libraries that use the blocking technique and others to perform matrix multiplication efficiently.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • *it takes dozens of CPU cycles or more to fetch a value from memory*, yes, but multiple requests are pipelined. I guess you're taking this into account, or you would have said hundreds, though. If a single 3GHz core can sustain 25GB/s memory bandwidth (not unreasonable for a dual or quad-core Haswell), then that's 8 bytes per cycle for continuous sequential-read streams. – Peter Cordes Jan 27 '18 at 20:30
  • @PeterCordes: Sure, I just can’t wrap cache behavior, superscalar processing, SIMD, compiler abstraction, dependency analysis, and everything else into one answer. – Eric Postpischil Jan 27 '18 at 21:10
  • @Ped7g: I’ve been accused of writing an encyclopedia entry for [an answer](https://stackoverflow.com/a/12007422/298225). – Eric Postpischil Jan 28 '18 at 00:53