I'm hoping to speed up this matrix-vector product using AVX-1 or earlier instructions:
// a is an array N columns of length M each
// b is length N
// c is length M
//
// M % 32 == N % 32 == 0
// all memory is nicely aligned and unaliased
void mat_vec_prod(const int8_t** a, const uint8_t* b, int16_t* c) {
for(int i = 0; i < M; ++i) {
c[i] = 0;
for(int j = 0; j < N; ++j)
c[i] += int16_t(a[j][i]) * int16_t(b[j]);
}
}
(I'm aware that swapping the loops is worth considering)
The intrinsics _mm_maddubs_epi16
and _mm_maddubs_pi16
could help with the uint8 x int8
dot product, but, in my case, the matrix has the awkward layout, where it's an array of pointers to columns (instead of rows).
One possibility is to load 8x8 patches of a
, and then transpose and multiply them by segments of b
. (I found this thread on 8x8 byte matrix transpose). However, this would have to use _mm_maddubs_pi16
, which has only half the throughput of _mm_maddubs_epi16
.
My question is: is it worth trying to load and transpose 16x16 patches instead, or will I run out of xmm
registers? What should my strategy be here?