I've just learned how to optimize GEMM
with x86 vector registers, and we were given matrices whose entries are 32-bit int
, and just neglect the overflow for simplification.
There's a _mm256_fmadd_pd
for double floating-point numbers to update the results C = AB+C
, but as for integers there seems no such FMA instructions. I tried first _mm256_mullo_epi32
to neglect overflows and then _mm256_add_epi32
to sum it up like
#include <immintrin.h>
__m256i alpha = ...// load something from memory
__m256i beta = ...// load something, too
gamma = _mm256_add_epi32( gamma, _mm256_mullo_epi32(alpha,beta) );
// for double variables, gamma = _mm256_fmadd_pd(alpha,beta,gamma);
_mm256_storeu_epi32(..some place,gamma);
the server for the lab has a Cascade Lake Xeon(R) Gold 6226R
with GCC 7.5.0
.
Intel Guide tells me the mullo
cost more CPIs than mul
(nearly twice, and much higher latency), which surely affects performance. Is there any FMA instructions or better implemention in this case?