I am interested in implementing memory aligned AVX/AVX2 vectorized multiplication for packed integers, floats & doubles.
As per the Intel Intrinsics Guide, for the Skylake architecture,
_mm256_mul_ps
and_mm256_mul_pd
both have latency = 4 & throughput(CPI) = 0.5._mm256_mullo_epi32
has latency = 10 & throughput (CPI) = 1.
How exactly is the floating point vector multiplication faster than integer vector multiplication?
There exists _mm256_mul_epi32
and _mm256_mul_epu32
which both have latency = 5 & throughput(CPI) = 0.5.
But as per << this SO answer >> they will need shuffle & unpack operations which have latency = 1 and throughput (CPI) = 1 for the AVX/AVX2 versions.
As per my current limited level of understanding, the throughput (CPI) for the shuffle and unpack operations may slow down the code by preventing unordered execution by the CPU.
So, I may be wrong on this point, the advantage of using the low latency _mm256_mul_epi32
and _mm256_mul_epu32
is probably not enough to get a faster vectorized integer multiplication.
I currently don't know how to properly benchmark such low level code. So I couldn't profile it.
Is it possible to implement a vectorized 32 bit integer multiplication through SIMD intrinsics that will be as fast/faster than the previously mentioned vectorized float/double multiplication?
And if so, how do we do it?