0

I've just learned how to optimize GEMM with x86 vector registers, and we were given matrices whose entries are 32-bit int, and just neglect the overflow for simplification. There's a _mm256_fmadd_pd for double floating-point numbers to update the results C = AB+C, but as for integers there seems no such FMA instructions. I tried first _mm256_mullo_epi32 to neglect overflows and then _mm256_add_epi32 to sum it up like

#include <immintrin.h>


__m256i alpha  = ...// load something from memory
__m256i beta = ...// load something, too
gamma = _mm256_add_epi32( gamma, _mm256_mullo_epi32(alpha,beta) );
// for double variables, gamma = _mm256_fmadd_pd(alpha,beta,gamma);
_mm256_storeu_epi32(..some place,gamma);

the server for the lab has a Cascade Lake Xeon(R) Gold 6226R with GCC 7.5.0. Intel Guide tells me the mullo cost more CPIs than mul(nearly twice, and much higher latency), which surely affects performance. Is there any FMA instructions or better implemention in this case?

  • 1
    There is `VPMADD52LUQ` (FMA for 52x52 bit integers), but not on Cascade Lake. You could just convert your matrices to `double` (on the fly while loading the data) and convert the result back to `int32`. You may get unwanted overflow behavior. – chtz Aug 03 '21 at 08:51
  • 2
    Basically no, x86 SIMD doesn't have integer multiply-accumulate (MAC) for 32 or 64-bit integers, until AVX-512 IFMA52 exposes the mantissa capability of the SIMD FP FMA units directly. SSE/AVX does have multiply => horizontal-add for 16-bit => 32-bit (`pmaddwd`), but nothing like that for 32-bit inputs. Mentions those: [Why is there no fused multiply-add for general-purpose registers on x86\_64 CPUs?](https://stackoverflow.com/q/49253907) (The question was about GPRs so it's not a great duplicate; I only noticed that after closing; will look for a better dup or maybe reopen) – Peter Cordes Aug 03 '21 at 09:31
  • Perhaps worth considering [Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?](https://stackoverflow.com/q/41403718), or just simply using `double`, or `float` if it's precise enough. – Peter Cordes Aug 03 '21 at 09:34
  • @PeterCordes Thanks. Basically I got to know x86 doesn't offers such a single instrutions that fits my requirement (maybe i should edit my title) , but I still want a better implemention because my code above didn't perform as well as that I expected. I hope chtz's comment that convert int to double may helps (though the convertion overhead is another issue) – TimeOrange Aug 03 '21 at 09:52
  • I also forgot to say that "CPI" (I guess you mean throughput, assuming OoO exec can hide latency) for different instructions isn't something you can just add up; different uops can run on different ALU ports. Ideally you can do more work on your data while its loaded into registers, although on your Skylake-derived CPU the 2 uops of `vpmulld` can run on either of p0 or p1, and `vpaddd` can run on any of the 3 vector ALU ports, so this load/mul/add code does have a good mix of uops to keep all 3 vector ALUs busy, assuming no memory bottlenecks (i.e. data hot in L2 cache at least). – Peter Cordes Aug 03 '21 at 09:58
  • @PeterCordes Much thanks. I'll test the performance again. The vectorization in gcc is a basic part of our HPC courses . I'm not quite familiar with the architecture of mordern CPUs (just learned assembly language this semester). – TimeOrange Aug 03 '21 at 10:22
  • 1
    See [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) (which debunks the premise of that title), and [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) for a more advanced version. – Peter Cordes Aug 03 '21 at 10:24
  • You changed the title, but without knowing your precision requirements, and whether other code needs to access this data as integer, it's not really possible to answer. There aren't faster drop-in replacements that would give you bit-exact identical results. SIMD conversion to `double` for 2 inputs, and back for the result, is more costly. (And would mean only 4 elements per vector, not 8 for int32 or float). Conversion to `float` is worth considering, although that competes for FMA execution ports, and creates a longer dep chain for the `gamma` prefix sum accumulator. – Peter Cordes Aug 03 '21 at 23:20

0 Answers0