Wait a minute, is i
or k
in the inner loop? Assuming k
is constant for all i
, then broadcast A[k]
into a whole vector, with _mm256_set1_pd(A[k])
, and same for the other array[k] operands.
As Raymond says, that's way to complex for a single instruction. Even sin()
isn't implemented in hardware (except for scalar the x87 version). Intel's intrinsic guide lists some Intel library functions that only Intel's SVML provides, not part of gcc / clang's <immintrin.h>
.
Use an FMA (_mm256_fmadd_pd
) for B[k] * C[i] + D[k]
, and pass that result to a vectorized sin()
function, if you can find one.
Feed that result into another FMA for the result[i] += A[k] * ...
.
This of course takes two 32B vectors with AVX.
AVX512 does 64B vectors, but is only available in Xeon Phi accelerator cards for now.