1

For example, v4fmaddps are instructions for packed single-precision (32-bit) floating-point elements, but I want to multiply accumulate 32-bit integer. Can I use v4fmaddps and input packed 32-bit integers. Does this change the computation results?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
anna
  • 39
  • 3
  • 1
    `ps` instructions operate on IEEE754 floating-point bit-patterns (https://en.wikipedia.org/wiki/Single-precision_floating-point_format). If your integer inputs/outpus are small and non-negative (less than 2^24 so they're bit-patterns for subnormal floats), then the results will be the same, but it'll be extremely slow on most CPUs (microcode assist on each instruction on Intel at least), vastly more expensive than `vpmulld` / `vpaddd`. Otherwise you want AVX512-IFMA52, or convert your integers to floating-point. But may be fastest just to use packed-integer multiply and add. – Peter Cordes Nov 28 '22 at 03:52
  • 1
    Integer multiply doesn't round the result, instead it keeps the low part and truncates the high half if the product is wider than the element width. So unlike FP, there's no precision advantage to an FMA for integers (aka a MAC, multiply-accumulate) vs. separate mul and add, only a hope to gain performance. – Peter Cordes Nov 28 '22 at 03:57
  • 1
    The small-integer trick would break in programs built with `-ffast-math`, which would set DAZ/FTZ so would treat the denormal inputs as zero, and flush denormal outputs to zero. (To avoid a slow microcode assist.) e.g. [Under what conditions does a C++ compiler use floating-point pipelines to do integer division with run-time-known values for higher performance?](https://stackoverflow.com/q/72087582) shows an example of using SIMD FP division to emulate integer SIMD division which doesn't exist even in AVX-512. – Peter Cordes Nov 28 '22 at 04:00
  • 1
    Correction, no, addition "works" on subnormals as integers, but multiplication doesn't. `2 x 2^-130` times itself is an much tinier number that underflows to 0, not `4 x 2^-130`. So no, FP FMA instructions won't work at all without some kind of trickery. – Peter Cordes Nov 28 '22 at 04:12
  • But is it wasteful to use vpmadd52luq instruction for 32-bit integer multiplying and accumulating. My purpose is to multiply 32-bit integer and accumulate the results efficiently. And I didn't find instructions for 32-bit integer "madd", only for 16-bit integer. Especially if I can multiply and add 16 32-bit integers residing in zmm register, as you say the v4fmaddps is extremely slow on most CPUs, do you know any other avx512 instructions supporting this function? – anna Nov 28 '22 at 11:06
  • 2
    Right, x86 doesn't have any integer MAC instructions except 52-bit in 64-bit elements, and the horizontal pairs ones for 8-bit and 16-bit elements. Probably your best bet is `vpmulld` / `vpaddd`, despite that costing 3 uops total per integer MAC. If you can leave your data as floating-point across multiple steps of computation, there might be something to gain from `vcvtdq2ps` so you can use `...ps` instructions. – Peter Cordes Nov 28 '22 at 17:13
  • https://en.wikichip.org/wiki/x86/avx512_vnni has `VPDPWSSD` to multiply-accumulate 16-bit inputs into 32-bit accumulators (so there's some horizontal adding). If your 32-bit integer multiplicands are actually only 16-bit zero-extended (so the upper halves will do `0 * 0`), you can use this. – Peter Cordes Nov 28 '22 at 18:08
  • If your CPU happens to support KNCNI, there is [`_mm512_fmadd_epi32`](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=3160&text=vpmadd) – chtz Nov 29 '22 at 00:30
  • Thanks! I think I will use vpmulld/vpaddd. – anna Nov 29 '22 at 02:06

0 Answers0