If C is row-major order, why does ARM intrinsic code assume column-major order?

Question

im not sure where is the best place to ask this but I am currently working on using ARM intrinsics and am following this guide: https://developer.arm.com/documentation/102467/0100/Matrix-multiplication-example

However, the code there was written assuming that the arrays are stored column-major order. I have always thought C arrays were stored row-major. Why did they assume this?

EDIT: For example, if instead of this:

void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    for (int i_idx=0; i_idx < n; i_idx++) {
        for (int j_idx=0; j_idx < m; j_idx++) {
            for (int k_idx=0; k_idx < k; k_idx++) {
                C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
            }
        }
    }
}

They had done this:

void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    for (int i_idx=0; i_idx < n; i_idx++) {
        for (int k_idx=0; k_idx < k; k_idx++) {
            for (int j_idx=0; j_idx < m; j_idx++) {
                C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
            }
        }
    }
}

The code would run faster due to spatial locality of accessing C in the order C[0], C[1], C[2], C[3] instead of in the order C[0], C[2], C[1], C[3] (where C[0], C[1], C[2], C[3] are contiguous in memory).

Address space is linear, not two-dimensional, there are no "rows" and "columns" - there are elements that are like further away and closer. `Why did they assume this?` `We have assumed a column-major layout of the matrices in memory. That is, an n x m matrix M is represented as an array M_array where Mij = M_array[n*j + i].` They could do `n*j_idx + i_idx` or `j_idx + n*i_idx`. What is a column or row, is just your/their interpretation, you can flip the terms and be fine. — KamilCuk, May 30 '21 at 17:04
@KamilCuk Ok I think I get what you mean. So i guess my question should be why did they decided to implement matrix multiply inefficiently and not accessing the arrays contiguously? — Very Confused, May 30 '21 at 17:24
`matrix multiply inefficiently` I do not see a way to do it more efficiently... what do you mean? `not accessing the arrays contiguously` Because it's a "Matrix multiplication example" - so they access the array as-if it's a matrix and they do it in a specific fashion to do specific operation on it. Programming language is a abstraction - overall, it's all machine instructions. It's just an abstraction - array becomes a matrix when where `Mij = ...` — KamilCuk, May 30 '21 at 17:32
@KamilCuk more efficiently as in by having a loop interchange between j and k in the edit — Very Confused, May 30 '21 at 18:17
The code you posted is invalid - `C[n*j_idx + i_idx]` is used when uninitialized, when `j_idx = 1`, and then it get's later zeroed. I would first just do `memset(C, 0, ...)`. Anyway, the page says "Matrix multiplication __example__" - most certainly it's not meant to be efficient, rather show how it works. — KamilCuk, May 30 '21 at 18:19
@KamilCuk Ok so i removed that initialization and lets say that C was already initialized to zero, is it right to say that the loop interchange will make the code run faster? im not sure... yea i understand its just an example, but just wondering if the code could be optimised with the loop interchange. — Very Confused, May 30 '21 at 18:27

Peter Cordes · Accepted Answer · 2021-06-07T08:20:57.973

You're not using C 2D arrays like C[i][j], so it's not a matter of how C stores anything, it's how 2D indexing is done manually in this code, using n * idx_1 + idx_2, with a choice of which you loop over in the inner vs. outer loops.

But the hard part of a matmul with both matrices non-transposed is that you need to make opposite choices for the two input matrices: a naive matmul has to stride through distant elements of one of the input matrices, so it's inherently screwed. That's a major part of why careful cache-blocking / loop-tiling is important for matrix multiplication. (O(n^3) work over O(n^2) data - you want to get the most use out of it for every time you bring it into L1d cache, and/or into registers.)

Loop interchange can speed things up to take advantage of spatial locality in the inner-most loop, if you do it right.

See the cache-blocked matmul example in What Every Programmer Should Know About Memory? which traverses contiguous memory in all 3 inputs in the inner few loops, picking the index that isn't scaled in any of the 3 matrices as the inner one. That looks like this:

  for (j_idx)
    for (k_idx)
      for (i_idx)
          C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];

Notice that B[k * j_idx + k_idx] is invariant over the loop inner loop, and that you're doing a simple dst[0..n] += const * src[0..n] operation over contiguous memory (which is easy to SIMD vectorize), although you're still doing 2 loads + 1 store for every FMA, so that's not going to max out your FP throughput.

Separate from the cache access pattern, that also avoids a long dependency chain into a single accumulator (element of C). But that's not a real problem for an optimized implementation: you can of course use multiple accumulators. FP math isn't strictly associative because of rounding error, but multiple accumulators are closer to pairwise summation and likely to have less bad FP rounding error than serially adding each element of the row x column dot product. It will have different results to adding in the order standard simple C loop does, but usually closer to the exact answer.

Your proposed loop order i,k,j is the worst possible.

You're striding through distant elements of 2 of the 3 matrices in the inner loop, including discontiguous access to C[], opposite of what you said in your last paragraph.

With j as the inner-most loop, you'd access C[0], C[n], C[2n], etc. on the first outer iteration. And same for B[], so that's really bad.

Interchanging the i and j loops would give you contiguous access to C[] in the middle loop instead of strided, and still rows of one, columns of the other, in the inner-most loop. So that would be strictly an improvement: yes you're right that this naive example is constructed even worse than it needs to be.

But the key issue is the strided access to something in the inner loop: that's a performance disaster; that's a major part of why careful cache-blocking / loop-tiling is important for matrix multiplication. The only index that is never used with a scale factor is i.

HolyBlackCat · Answer 2 · 2021-05-30T17:23:49.123

1

C is not inherently row-major or column-major.

When writing a[i][j], it's up to you to decide whether i is a row index or a column index.

While it's somewhat of a common convention to write the row index first (making the arrays row-major), nothing stops you from doing the opposite.

Also, remember that A × B = C is equivalent to Bt × At = Ct (t meaning a transposed matrix), and reading a row-major matrix as if it was column-major (or vice versa) transposes it, meaning that if you want to keep your matrices row-major, you can just reverse the order of the operands.

edited May 30 '21 at 17:23

answered May 30 '21 at 17:02

HolyBlackCat

78,603
9
131
207

Isn't it more efficient to be sweeping over a 2D array via incrementing the second index rather than the first index due to how the array is stored in contiguous memory? – Very Confused May 30 '21 at 17:17
@VeryConfused Yes, it may be more efficient. But this has nothing to do with choosing which index is row and which is column. – HolyBlackCat May 30 '21 at 17:19
I see so I guess my question should be why did they decide to do a less efficient way of matrix multiply by not accessing the arrays contiguously – Very Confused May 30 '21 at 17:22
1

@VeryConfused That's not true. You can increment the second index in the inner loop regardless of whether it means rows or columns. Nobody said you must traverse matrices row by row instead of column by column. – HolyBlackCat May 30 '21 at 17:25
what about the code in the question edit, if a loop interchange was done wouldn't the code run faster? – Very Confused May 30 '21 at 18:20
@VeryConfused You can measure it. – HolyBlackCat May 30 '21 at 18:28

If C is row-major order, why does ARM intrinsic code assume column-major order?

2 Answers2

Your proposed loop order i,k,j is the worst possible.