What is the best way to multiply each 32bit entry of two _mm256i
registers with each other?
_mm256_mul_epu32
is not what I'm looking for because it produces 64bit outputs. I want a 32bit result for every 32bit input element.
Moreover, I'm sure that the multiplication of two 32bit values will not overflow.
Thanks!