I'm working on matrix-vector multiplication + accumulation function for neural net, and I've finally decided to vectorize the whole thing manually instead of relying on autovectorization.
I came up with this function:
#include <immintrin.h>
#define re *restrict //just simplification
// computes a2[n2]=w[n2][n1]*a1[n1]+b[n2]
void l_frw(const int n2,const int n1,float re a2,const float re a1,const float w[restrict][n1],const float re b)
{
__m256 x,y,z;
__m256 one=_mm256_set1_ps(1.0f);
for(int i=0; i<n2; i++)
{
a2[i]=b[i];
z=_mm256_setzero_ps();
for(int j=0; j<n1; j+=8)
{
x=_mm256_loadu_ps(&a1[j]);
y=_mm256_loadu_ps(&w[i][j]);
z=_mm256_fmadd_ps(x,y,z); //accumulates dot product of each row into z
}
z=_mm256_dp_ps(z,one,0b11111111);
a2[i]+=z[0]+z[4];
}
}
(Yes it works only with multiples of 8 sized vectors).
It is about 20% faster than the naive autovectorize version, which is pretty neat, but I'm still looking for improvements. Any suggestions on how to speed this up ?