2

I have a matrix multiplication which looks like this:

void gemm_nn(int N, int K, float *A, float *B, float *C) {
    int j, k;
    for (k = 0; k < K; k++)
        for (j = 0; j < N; j++)
            C[j] += A[k] * B[k * N + j];
}

the float are single, 4 bytes, 32 bits.

I would like to optimize the loop with armv8-a 64-bit.

Could I load 4 consecutive floats in a single 128-bit register and does a single multiply-accumulative operation?

Could you point the instructions I should try to achieve this?

gregoiregentil
  • 1,793
  • 1
  • 26
  • 56
  • 1
    `B[k * N + j]` in the inner-most loop is what makes cache-blocking necessary for good performance. If your matrix isn't tiny, and you aren't trying to fold another computation into the matmul, see [How does BLAS get such extreme performance?](https://stackoverflow.com/q/1303182), and consider simply using an optimized BLAS library function, or Eigen. Related (and more links to matmul optimization stuff): [Matrix Multiplication of size 100\*100 using SSE Intrinsics](https://stackoverflow.com/a/47686673). SIMD loads of consecutive floats are not the hard part. – Peter Cordes Aug 14 '18 at 02:23

1 Answers1

1

Native SIMD ld1 {v16.4s} and fmla instructions are what is needed.

gregoiregentil
  • 1,793
  • 1
  • 26
  • 56