2

I am interested in implementing memory aligned AVX/AVX2 vectorized multiplication for packed integers, floats & doubles.

As per the Intel Intrinsics Guide, for the Skylake architecture,

  • _mm256_mul_ps and _mm256_mul_pd both have latency = 4 & throughput(CPI) = 0.5.
  • _mm256_mullo_epi32 has latency = 10 & throughput (CPI) = 1.

How exactly is the floating point vector multiplication faster than integer vector multiplication?

There exists _mm256_mul_epi32 and _mm256_mul_epu32 which both have latency = 5 & throughput(CPI) = 0.5.

But as per << this SO answer >> they will need shuffle & unpack operations which have latency = 1 and throughput (CPI) = 1 for the AVX/AVX2 versions.

As per my current limited level of understanding, the throughput (CPI) for the shuffle and unpack operations may slow down the code by preventing unordered execution by the CPU.

So, I may be wrong on this point, the advantage of using the low latency _mm256_mul_epi32 and _mm256_mul_epu32 is probably not enough to get a faster vectorized integer multiplication.

I currently don't know how to properly benchmark such low level code. So I couldn't profile it.

Is it possible to implement a vectorized 32 bit integer multiplication through SIMD intrinsics that will be as fast/faster than the previously mentioned vectorized float/double multiplication?

And if so, how do we do it?

SYN
  • 47
  • 5
  • First of all: Unlikely. But your question leaves a lot room for clarification, most importantly, where do you get the operands and what do you do with the result? Is latency really an issue? What CPU are you running? The `_mm256_mul_epi32` is probably only worth it, if you want the 64bit result (or parts of that result, e.g., different overflow behavior). – chtz Sep 01 '19 at 21:17
  • 3
    Answer: No for both Intel and AMD. Neither of them have the execution units to do it. Each 64-bit SIMD lane has a 52-bit multiplier that can act as a pair of 23-bit multipliers by suppressing carry lines. Thus it can handle 1 x DP or 2 x SP multiply at full speed. But the same trick cannot be used to make 2 x 32-bit multipliers as the 52-bit multiplier isn't wide enough. Thus a 32-bit multiply must be passed into the full 52-bit multiply - limiting them to 1 x 32-bit for each 64-bit lane. Therefore the only way to do `_mm256_mullo_epi32` is to run it through the hardware twice. – Mysticial Sep 03 '19 at 19:31

0 Answers0